Given my recent involvement with the design of a somewhat complex trial centered around a Bayesian data analysis, I am appreciating more and more that Bayesian approaches are a very real option for clinical trial design. A key element of any study design is sample size. While some would argue that sample size considerations are not critical to the Bayesian design (since Bayesian inference is agnostic to any pre-specified sample size and is not really affected by how frequently you look at the data along the way), it might be a bit of a challenge to submit a grant without telling the potential funders how many subjects you plan on recruiting (since that could have a rather big effect on the level of resources - financial and time - required.)

Earlier, I touched a bit on these issues while discussing the frequentist properties of Bayesian models, but I didn’t really get directly into sample size considerations. I’ve been doing some more exploring and simulating, so I am sharing some of that here.

### Bayesian inference

In the Bayesian framework, all statistical inference is based on the estimated posterior probability distribution for the parameter(s) of interest (say \(\theta\)) once we have observed the data: \(P(\theta | \text{data})\). In addition to extracting the mean or median of the distribution as a point estimate, we can get a measure of uncertainty by extracting quantiles from the distribution (a 95% interval comes to mind, though there is no reason to be limited by that convention).

Alternatively, we can make a probability statement about the parameter being above or below a threshold of effectiveness. For example if we are estimating a log-odds ratio for an intervention that prevents a bad outcome, we might be interested in \(P(log(OR) < 0).\) We may even pre-specify that the trial will be considered a success if \(P(log(OR) < 0) > 0.95.\)

`library(simstudy)library(data.table)library(ggplot2)library(cmdstanr)library(posterior)library(bayesplot)`

### Data generation

To investigate, I will use a simple binary outcome \(Y\) that is changed by exposure or intervention \(A\). In this first case, I will randomly select a log-odds ratio from \(N(\mu = -1, \sigma = 0.5).\)

`defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * A", dist = "binary", link="logit")set.seed(21)lor <- rnorm(1, -1, 0.5)dT <- genData(200)dT <- trtAssign(dT, grpName = "A")dT <- addColumns(defB, dT)`

### Model fitting

I am primarily interested in recovering the log-odds ratio use to generate the data using a simple Bayesian model, written here in `Stan`

. The parameter of interest in the `Stan`

model is \(\beta\), log-odds ratio. The prior distribution is \(t_{student}(df=3, \mu=0, \sigma=5).\)

`data { int<lower=0> N; int<lower=0,upper=1> y[N]; vector[N] x; real mu; real s;}parameters { real alpha; real beta;}model { beta ~ student_t(3, mu, s); y ~ bernoulli_logit(alpha + beta * x);}`

To estimate the posterior distribution, I am using the `R`

package `cmdstanr`

:

`mod <- cmdstan_model("code/bayes_logistic.stan")fit <- mod$sample( data = list(N=nrow(dT), y=dT$Y, x=dT$A, mu=0, s=5), refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 1000, iter_sampling = 4000, step_size = 0.1, show_messages = FALSE)`

`## Running MCMC with 4 parallel chains...## ## Chain 1 finished in 0.2 seconds.## Chain 2 finished in 0.2 seconds.## Chain 3 finished in 0.2 seconds.## Chain 4 finished in 0.2 seconds.## ## All 4 chains finished successfully.## Mean chain execution time: 0.2 seconds.## Total execution time: 0.4 seconds.`

(If you’re impressed at how fast that model ran, it is because it is on my new MacBook Pro with the new Apple M1 chip - 4 or 5 times faster than my previous MacBook Pro with an Intel chip. It took me a while to get `R`

, `RStudio`

, and particularly, `cmdstan`

up and running, but once I did, it has been totally worth it.)

First thing to check, of course, is whether the sampling from the posterior distribution was well-behaved. Here is a trace plot for the parameter \(\beta\):

`draws_array <- as_draws_array(fit$draws())mcmc_trace(draws_array, pars = "beta")`

Here are the summary statistics of the posterior distribution. Based on these data, the median log-odds ratio is \(-0.61\) and \(P(lor < 0) = 89\%\):

`res <- data.table(fit$summary(variables = "beta"))[, .(median, sd, q95, len = q95-q5)]betas <- data.table(beta = as.matrix(draws_array[,,"beta"]))res$p0 <- mean(betas$beta.V1 < 0)res`

`## median sd q95 len p0## 1: -0.6050845 0.511862 0.2103548 1.673138 0.88875`

A plot of the posterior distribution is the best way to fully assess the state of knowledge about the parameter having observed this data set. The density plot includes a vertical dashed line at the median, and the dark shading indicates lowest \(95\%\) of the density. The fact that the cutoff point \(0\) lies within the bottom \(95\%\) makes it clear that the threshold was not met.

`d <- density(draws_array[,,"beta"], n = 1024)plot_points <- as.data.table(d[c("x", "y")])median_xy <- plot_points[findInterval(res$median, plot_points$x)]ggplot(data = plot_points, aes(x = x, y = y)) + geom_area(aes(fill = (x < res$q95))) + geom_segment(x = median_xy$x, xend=median_xy$x, y=0, yend = median_xy$y, size = 0.2, color = "white", lty=3) + scale_fill_manual(values = c("#adc3f2", "#5886e5")) + theme(panel.grid = element_blank(), legend.position = "none")`

### Bayesian power

If we want to assess what kind of sample sizes we might want to target in study based on this relatively simple design (binary outcome, two-armed trial), we can conduct a Bayesian power analysis that has a somewhat different flavor from the more typical frequentist Bayesian that I typically do with simulation. There are a few resources I’ve found very useful here: this book by Spiegelhalter et al and these two papers, one by Wang & Gelfand and another by De Santis & Gubbiotti

When I conduct a power analysis within a frequentist framework, I usually assume set of *fixed/known* effect sizes, and the hypothesis tests are centered around the frequentist p-value at a specified level of \(\alpha\). The Bayesian power analysis differs with respect to these two key elements: a distribution of effect sizes replaces the single fixed effect size to accommodate uncertainty, and the posterior distribution probability threshold (or another criteria such as the variance of the posterior distribution or the length of the 95% credible interval) replaces the frequentist hypothesis test.

We have a prior distribution of effect sizes. De Santis and Gubbiotti suggest it is not necessary (and perhaps less desirable) to use the same prior used in the model fitting. That means you could use a skeptical (conservative) prior centered around 0, in the analysis, but use a prior for data generation that is consistent with a clinically meaningful effect size. In the example above the *analysis prior* was

\[ \beta \sim t_{student}(df = 3, \mu = 0, \sigma = 5) \]

and the *data generation prior* was

\[ \beta \sim N(\mu = -1, \sigma = 0.5).\]

To conduct the Bayesian power analysis, I replicated the simulation and model fitting shown above 1000 times for each of seven different sample sizes ranging from 100 to 400. (Even though my laptop is quite speedy, I used the NYU Langone Health high performance cluster Big Purple to do this, because I wanted to save a few hours.) I’m not showing the parallelized code in this post, but take a look here for an example similar to this. (I’m happy to share with anyone if you’d like to have the code. Updated 7/1/2021: code has been added in the Addendum below.)

The plots below show a sample of 20 posterior distributions taken from the 1000 generated for each of three sample sizes. As in the frequentist context, an increase in sample size appears to reduce the variance of the posterior distribution estimated in a Bayesian model. We can see visually that as the sample size increases, the distribution collapses towards the mean or median, which has a direct impact on how confident we are in drawing conclusions from the data; in this case, it is apparent that as sample size increases, the proportion of posterior distributions meet the 95% threshold increases.

Here is a curve that summarizes the probability of a posterior distribution meeting the 95% threshold at each sample size level. At a size of 400, 80% of the posterior distributions (which are themselves based on data generated from varying effect sizes specified by the *data generation prior* and the *analysis prior*) would lead us to conclude that the trial is success.

References:

Wang, Fei, and Alan E. Gelfand. “A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models.” *Statistical Science* 17, no. 2 (2002): 193-208.

Spiegelhalter, David J., Keith R. Abrams, and Jonathan P. Myles. *Bayesian approaches to clinical trials and health-care evaluation*. Vol. 13. John Wiley & Sons, 2004.

De Santis, Fulvio, and Stefania Gubbiotti. “Sample Size Requirements for Calibrated Approximate Credible Intervals for Proportions in Clinical Trials.” *International Journal of Environmental Research and Public Health* 18, no. 2 (2021): 595.

## Addendum

Here is the full R code for the Bayesian power analysis using simulation. I am including the `slurmR`

code that I used to execute on the HPC:

`library(simstudy)library(data.table)library(ggplot2)library(bayesplot)library(posterior)library(cmdstanr)library(slurmR)library(collapse)s_define <- function() { defB <- defDataAdd(varname = "Y", formula = "-2 + ..lor * rx", dist = "binary", link="logit") return(list(defB = defB)) # list_of_defs is a list of simstudy data definitions}s_generate <- function(list_of_defs, argsvec) { list2env(list_of_defs, envir = environment()) list2env(as.list(argsvec), envir = environment()) #--- add data generation code ---# lor <- rnorm(1, mu.lor, sigma.lor) dT <- genData(nobs) dT <- trtAssign(dT, grpName = "rx") dT <- addColumns(defB, dT) return(dT[]) }s_model <- function(generated_data, mod, argsvec) { list2env(as.list(argsvec), envir = environment()) dt_to_list <- function(dx) { N <- nrow(dx) ## number of observations y <- dx$Y ## individual outcome x <- dx$rx ## treatment arm for individual s <- t_sigma mu <- 0 # can be mu.lor list(N=N, y=y, x=x, s=s, mu = mu) } fit <- mod$sample( data = dt_to_list(generated_data), refresh = 0, chains = 4L, parallel_chains = 4L, iter_warmup = 1000, iter_sampling = 4000, step_size = 0.1, show_messages = FALSE ) res <- data.table(fit$summary(variables = "beta"))[, .(median, sd, q95, len = q95-q5)] draws_array <- as_draws_array(fit$draws()) betas <- data.table(beta = as.matrix(draws_array[,,"beta"])) res$p0 <- mean(betas$beta.V1 < 0) return(res) # model_results is a data.table }s_single_rep <- function(list_of_defs, argsvec, mod) { set_cmdstan_path(path = "/gpfs/share/apps/cmdstan/2.25.0") list_of_defs <- s_define() generated_data <- s_generate(list_of_defs, argsvec) model_results <- s_model(generated_data, mod, argsvec) return(model_results)} s_replicate <- function(argsvec, nsim, mod) { list_of_defs <- s_define() model_results <- lapply( X = 1 : nsim, FUN = function(x) s_single_rep(list_of_defs, argsvec, mod) ) #--- add summary statistics code ---# model_sums <- unlist2d(lapply(model_results, function(x) x), idcols = "replicate", DT = TRUE) summary_stats <- model_sums[ , .(p_95 = mean(p0 >= 0.95), p_len = mean(len <= 2), p_sd = mean(sd <= 0.5)) ] model_ests <- data.table(t(argsvec), summary_stats) return(model_ests) }###scenario_list <- function(...) { argmat <- expand.grid(...) return(asplit(argmat, MARGIN = 1))}mu.lor <- c(0, -0.5, -1.0, -1.5)sigma.lor <- c(0.25)nobs <- c(100, 150, 200, 250, 300, 350, 400)t_sigma <- c(1, 5, 10)scenarios <- scenario_list(mu.lor = mu.lor, sigma.lor = sigma.lor, nobs = nobs, t_sigma = t_sigma)set_cmdstan_path(path = ".../cmdstan/2.25.0")mod <- cmdstan_model("present.stan")job <- Slurm_lapply( X = scenarios, FUN = s_replicate, mod = mod, nsim = 1200, njobs = min(length(scenarios), 90L), mc.cores = 4L, job_name = "i_bp", tmp_path = "/gpfs/data/troxellab/ksg/scratch", plan = "wait", sbatch_opt = list(time = "03:00:00", partition = "cpu_short"), export = c("s_single_rep", "s_define", "s_generate", "s_model"), overwrite = TRUE)summary_stats <- Slurm_collect(job)final_tab <- rbindlist(summary_stats)save(final_tab, file = ".../bp.rda")`

R Bayesian model Stan

## FAQs

### Does sample size matter in Bayesian statistics? ›

No, especially because the question is complicated by the issue of priors. However, in general the same rules apply - **bigger samples produce tighter parameter estimates**. BA usually depends on priors. However, If you able to manage a bigger sample, analysis would be great, even if you have a good informative prior.

### Does sample size affect Bayes factor? ›

3, **expected Bayes factors increase with increasing sample size if the true effect size is larger than zero**. If there is no difference between groups (δ = 0), the expected Bayes factor approaches zero. This implies that the mean log Bayes factor decreases to −∞ when sample size increases.

### What is effective sample size Bayesian? ›

Effective sample size in Bayesian statistics (MCMC)

This means that the effective sample size is **generally lower than the number of draws**. For this reason, the effective sample size – rather than the actual sample size – is typically used when determining if an MCMC model has converged.

### What sample size is required to detect a true difference in means of 5 with probability of at least 0.80 if it is known that the common variance is 150? ›

8 with a probability η = 0.80 that the Bayes factor is at least 5, the sample size required is **67 per group** (see Table 5, which will be discussed after the next two sections).

### Is sample size 30 enough? ›

Key Takeaways. The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population's distribution. **Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold**.

### How do you know if a sample size is good? ›

**A good maximum sample size is usually 10% as long as it does not exceed 1000**. A good maximum sample size is usually around 10% of the population, as long as this does not exceed 1000. For example, in a population of 5000, 10% would be 500. In a population of 200,000, 10% would be 20,000.

### What are the factors affecting sample size determination? ›

The factors affecting sample sizes are **study design, method of sampling, and outcome measures – effect size, standard deviation, study power, and significance level**.

### Does increasing sample size increases accuracy? ›

Because we have more data and therefore more information, our estimate is more precise. **As our sample size increases, the confidence in our estimate increases, our uncertainty decreases and we have greater precision.**

### What is Bayesian sampling? ›

Introduction. Importance sampling is **a Bayesian estimation technique which estimates a parameter by drawing from a specified importance function rather than a posterior distribution**. Importance sampling is useful when the area we are interested in may lie in a region that has a small probability of occurrence.

### What is Bayesian analysis used for? ›

Bayesian analysis, a method of statistical inference (named for English mathematician Thomas Bayes) that **allows one to combine prior information about a population parameter with evidence from information contained in a sample to guide the statistical inference process**.

### What is the purpose of Bayesian analysis in decision making? ›

Bayesian decision making involves **basing decisions on the probability of a successful outcome, where this probability is informed by both prior information and new evidence the decision maker obtains**. The statistical analysis that underlies the calculation of these probabilities is Bayesian analysis.

### Why must sample size be greater than 30? ›

Sample size equal to or greater than 30 are required **for the central limit theorem to hold true**. A sufficiently large sample can predict the parameters of a population such as the mean and standard deviation.

### What is the minimum sample size for statistical significance? ›

“A minimum of 30 observations is sufficient to conduct significant statistics.” This is open to many interpretations of which the most fallible one is that the sample size of 30 is enough to trust your confidence interval.

### What is the required sample size for a 95% confidence level with a 5% accuracy? ›

A sample size of **385** corresponds with a confidence level of 95% and margin of error of 5% when you have a large population (> 100,000), which is often used in research.

### How does sample size affect determinations of statistical significance? ›

**Higher sample size allows the researcher to increase the significance level of the findings**, since the confidence of the result are likely to increase with a higher sample size. This is to be expected because larger the sample size, the more accurately it is expected to mirror the behavior of the whole group.

### How do you know if a sample size is large enough for a normal distribution? ›

In practice, some statisticians say that a sample size of 30 is large enough **when the population distribution is roughly bell-shaped**. Others recommend a sample size of at least 40.

### What happens if sample size is less than 30? ›

For example, when we are comparing the means of two populations, if the sample size is less than 30, then we **use the t-test**. If the sample size is greater than 30, then we use the z-test.

### Is a sample size of 150 enough? ›

In a study of tens of thousands of lung function data we found that only samples over 1,000 subjects led to stable results. **150 is a very minimum**, and when you have a number of such sets, predicted values may differ by + or -4 Z-scores.

### What is a good sample size for a quantitative study? ›

Summary: **40 participants** is an appropriate number for most quantitative studies, but there are cases where you can recruit fewer users.

### How do you determine the sample size of a small population? ›

the size of the sample is small when compared to the size of the population. When the target population is less than approximately 5000, or if the sample size is a significant proportion of the population size, such as 20% or more, then the standard sampling and statistical analysis techniques need to be changed.

### What is the meaning of sample size determination? ›

Sample size determination is **the act of choosing the number of observations or replicates to include in a statistical sample**. The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample.

### What determines the size of the sample and the method of sampling? ›

In general, the sample size used in a study is determined based on **the cost of data collection, and based on sufficient statistical power**.

### What causes sample size to increase? ›

Power of the study is equal to 1-type II error; hence any study should be at least 80% powered. The sample size increases **when the power of study is increased** from 80% to 90% or 95%.

### What happens if the sample size is too small? ›

Too small a sample **may prevent the findings from being extrapolated**, whereas too large a sample may amplify the detection of differences, emphasizing statistical differences that are not clinically relevant.

### Does sample size affect validity or reliability? ›

**Appropriate sample sizes are critical for reliable, reproducible, and valid results**. Evidence generated from small sample sizes is especially prone to error, both false negatives (type II errors) due to inadequate power and false positives (type I errors) due to biased samples.

### Does sample size affect validity? ›

The answer to this is that **an appropriate sample size is required for validity**. If the sample size it too small, it will not yield valid results. An appropriate sample size can produce accuracy of results.

### Why is it called Bayesian? ›

The term Bayesian **derives from the 18th-century mathematician and theologian Thomas Bayes**, who provided the first mathematical treatment of a non-trivial problem of statistical data analysis using what is now known as Bayesian inference.

### How is Bayesian probability calculated? ›

Formula for Bayes' Theorem

**P(A|B) – the probability of event A occurring, given event B has occurred**. P(B|A) – the probability of event B occurring, given event A has occurred. P(A) – the probability of event A. P(B) – the probability of event B.

### Where is Bayesian statistics used? ›

Bayesian Inference is a popular data science technique for businesses, especially for pricing decisions. Businesses can determine prices for products based on field information like retail and wholesale prices, the size of the market, and market share.

### How do you do a Bayesian analysis? ›

**Important!**

- Step 1: Identify the Observed Data.
- Step 2: Construct a Probabilistic Model to Represent the Data.
- Step 3: Specify Prior Distributions.
- Step 4: Collect Data and Application of Bayes' Rule.

### What are the steps involved in Bayesian data analysis? ›

Recall the basic steps of a Bayesian analysis from Section 2.3 (p. 25): Identify the data, define a descriptive model, specify a prior, compute the posterior distribution, interpret the posterior distribution, and, check that the model is a reasonable description of the data.

### What are the three main components of calculating Bayesian probabilities? ›

**the Prior Distribution, 2.** **Likelihood Principle, 3.** **Posterior Probabilities**, 4.

### What are the advantages of Bayesian statistics? ›

Some advantages to using Bayesian analysis include the following: **It provides a natural and principled way of combining prior information with data, within a solid decision theoretical framework**. You can incorporate past information about a parameter and form a prior distribution for future analysis.

### Which of the probability is needed to use Bayesian? ›

Graphical Models

A Bayesian network is a probability model defined over an acyclic directed graph. It is factored by using **one conditional probability distribution for each variable in the model**, whose distribution is given conditional on its parents in the graph.

### What happens when sample size increases? ›

As the sample sizes increase, **the variability of each sampling distribution decreases so that they become increasingly more leptokurtic**. The range of the sampling distribution is smaller than the range of the original population.

### How large of a sample size is statistically significant? ›

So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be **80%** certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05.

### Is 30 respondents enough for a survey? ›

Academia tells us that **30 seems to be an ideal sample size for the most comprehensive view of an issue**, but studies with as few as 10 participants can yield fruitful and applicable results (recruiting excellence is even more important here!).

### What is the minimum sample size for qualitative research? ›

It has previously been recommended that qualitative studies require a minimum sample size of **at least 12** to reach data saturation (Clarke & Braun, 2013; Fugard & Potts, 2014; Guest, Bunce, & Johnson, 2006) Therefore, a sample of 13 was deemed sufficient for the qualitative analysis and scale of this study.

### What is the minimum sample size needed for a 95% confidence interval? ›

Assume a population proportion of 0.5, and unlimited population size. Remember that z for a 95% confidence level is 1.96. Refer to the table provided in the confidence level section for z scores of a range of confidence levels. Thus, for the case above, a sample size of at least **385 people** would be necessary.

### What is the sample size required 90% level of confidence? ›

A 90 percent level can be obtained with a smaller sample, which usually translates into a less expensive survey. To obtain a 3 percent margin of error at a 90 percent level of confidence requires a sample size of **about 750**. For a 95 percent level of confidence, the sample size would be about 1,000.

### What sample size is needed for a 95% confidence interval with at most a margin of error of $2000? ›

Remember the sample size required to have a margin of error of $2000 at the 95% level of confidence is **n=792**.

### Is 5% significance the same as 95% confidence? ›

So, **if your significance level is 0.05, the corresponding confidence level is 95%**. If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant. If the confidence interval does not contain the null hypothesis value, the results are statistically significant.

### What are the factors affecting sample size determination? ›

The factors affecting sample sizes are **study design, method of sampling, and outcome measures – effect size, standard deviation, study power, and significance level**.

### What are the factors that determine sample size determination? ›

In general, three or four factors must be known or estimated to calculate sample size: (1) the effect size (usually the difference between 2 groups); (2) the population standard deviation (for continuous data); (3) the desired power of the experiment to detect the postulated effect; and (4) the significance level.

### Do sample sizes matter? ›

**A larger sample size should hypothetically lead to more accurate or representative results**, but when it comes to surveying large populations, bigger isn't always better. In fact, trying to collect results from a larger sample size can add costs – without significantly improving your results.

### What makes Bayesian statistics different? ›

There are many advantages and disadvantages of both frequentist and Bayesian statistics. Frequentist statistics never uses or calculates the probability of the hypothesis, while Bayesian **uses probabilities of data and probabilities of both hypothesis**.

### Does sample size affect accuracy? ›

Summary. The standard error is dependent on sample size: **larger sample sizes produce smaller standard errors**, which estimate population parameters with higher precision. Scientists need to test more samples in their experiments to increase the certainty of their estimates.

### Why must sample size be greater than 30? ›

Sample size equal to or greater than 30 are required **for the central limit theorem to hold true**. A sufficiently large sample can predict the parameters of a population such as the mean and standard deviation.

### Why is it important to determine the sample size? ›

The size of a sample influences two statistical properties: 1) **the precision of our estimates** and 2) the power of the study to draw conclusions.

### What will you use in determining the sample size? ›

In calculating the sample size, the **standard deviation** is useful in estimating how much the responses you receive will vary from each other and from the mean number, and the standard deviation of a sample can be used to approximate the standard deviation of a population.

### How do you determine sample size for a study? ›

Zα2is the standard normal z-value for a significance level α = 0.05, which is 1.196. Zβis the standard normal z-value for the power of 80%, which is 0.84. Using the formula above, the required sample size per group is 90, and thus the total sample size required is 180.

### What is the purpose of the Bayesian analysis? ›

The goal of Bayesian analysis is “to translate subjective forecasts into mathematical probability curves in situations where there are no normal statistical probabilities because alternatives are unknown or have not been tried before” (Armstrong, 2003:633).

### How do you do a Bayesian analysis? ›

**Important!**

- Step 1: Identify the Observed Data.
- Step 2: Construct a Probabilistic Model to Represent the Data.
- Step 3: Specify Prior Distributions.
- Step 4: Collect Data and Application of Bayes' Rule.

### What is Bayesian used for? ›

Bayesian statistics is a particular approach to **applying probability to statistical problems**. It provides us with mathematical tools to update our beliefs about random events in light of seeing new data or evidence about those events.

### What happens when a sample size is not big enough? ›

Changing these will affect how large of a sample size you need to achieve appropriate statistical power. Sampling. The most obvious strategy is simply to **sample more of your population**. Keep your survey open, contact more potential participants, or consider widening the population.

### What happens when sample size increases? ›

As the sample sizes increase, **the variability of each sampling distribution decreases so that they become increasingly more leptokurtic**. The range of the sampling distribution is smaller than the range of the original population.

### What happens if the sample size is too small? ›

Too small a sample **may prevent the findings from being extrapolated**, whereas too large a sample may amplify the detection of differences, emphasizing statistical differences that are not clinically relevant.

### How do you know if a sample size is large enough for a normal distribution? ›

In practice, some statisticians say that a sample size of 30 is large enough **when the population distribution is roughly bell-shaped**. Others recommend a sample size of at least 40.

### What sample size is considered large enough? ›

Often a sample size is considered “large enough” if it's **greater than or equal to 30**, but this number can vary a bit based on the underlying shape of the population distribution. In particular: If the population distribution is symmetric, sometimes a sample size as small as 15 is sufficient.

### What happens if sample size is less than 30? ›

For example, when we are comparing the means of two populations, if the sample size is less than 30, then we **use the t-test**. If the sample size is greater than 30, then we use the z-test.