Activity 1: Sampling from a population

Introduction

Welcome to Module 1: Sampling from a population.

In this module, we’ll explore how to relate samples to populations, calculate confidence intervals, and understand the impact of sample size on study results. To introduce these statistical concepts, we’ll start with a simple example focused on determining the prevalence of malaria. This approach provides a gentle introduction to thinking about these ideas in the context of study design. These same concepts are directly applicable to malaria molecular surveillance (MMS) studies. For instance, in MMS studies, we might design our research to estimate the prevalence of molecular markers, such as drug resistance markers, rather than the prevalence of malaria infection itself.

Learning Outcomes

By the end of this tutorial, you will be able to:

Define the target population for a study.
Differentiate between a population and a sample.
Calculate the 95% confidence interval.
Assess how sampling variability impacts the representation of the population.
Understand the effect of sample size on confidence intervals.

What is my population?

In any study, clearly defining the target population is crucial. Let’s consider a few examples of made-up studies.

QUIZ - Target population

Relating the sample to the population

Sampling the entire population is often impractical, whether due to cost constraints or the challenges of reaching everyone. Fortunately, we can design our study to ensure that the sample we collect is a representative subset of the population.

When designing epidemiological studies, various data sources can offer valuable insights into the population. Examples include Demographic and Health Surveys (DHS) or population censuses, which provide comprehensive information to help guide study design.

Using the population census

Let’s go back to our made-up study.

We now have access to a census of the entire population of the village (N= 10000) with information on each residents age and sex. For purposes of this tutorial, there are 2678 people in the village with malaria infections and we “know” the true infection status of every individual. This will help us understand how our sample relates to the entire population.

This is the information we have in the census (here we show the first 6 residents in the census):

id	age	sex	malaria_infection
1	27	Male	Not infected
2	79	Female	Not infected
3	21	Female	Infected
4	8	Female	Not infected
5	4	Male	Not infected
6	37	Male	Infected

Population demographics

Below we can see a breakdown of malaria-infected individuals in this population.

Let’s look at the age and sex distribution.

Sampling from the population

Suppose we have resources to determine the infection status of 200 individuals. Let’s randomly sample 200 individuals from the population and see how they compare with the overall population. We can explore demographic differences by plotting the age and sex distribution of the sample compared with the population.

Below we use the function sampleFromPopulation() and we specify what our sample size is and from where we want to sample (in our case from the census data). We can use the plotAgeSexDistribution() and plotInfectedProportion() functions to visualize our results. Click on “Run Code”.

sample <- sampleFromPopulation(sample_size = 200, census)
comparison <- compareSampleToPopulation(sample, census)

plotAgeSexDistribution(comparison)

plotInfectedProportion(comparison)

Reflection:

Do we see a similar age and sex distribution? What about the proportion of infected and not infected individuals in the population?

Sampling many times from the population

Now run this a few times with the sample size of 200 to see how it changes with every random sample. Click “Start Over” and then “Run code”.

sample <- sampleFromPopulation(sample_size = 200, census)
comparison <- compareSampleToPopulation(sample, census)

plotAgeSexDistribution(comparison)
plotInfectedProportion(comparison)

QUIZ - Sampling from the population

Reflection:

Smaller sample sizes are more susceptible to sampling variability. With a limited number of individuals, the likelihood of the sample deviating from the population characteristics increases. Think about how this may or may not impact your results.

✨ BONUS QUESTION ✨

You will have noticed from our exploration above that the sample differs from the population and it doesn’t always have the same age and sex distribution.

QUIZ - Sampling bias

Estimating prevalence in our sample and calculating the 95% confidence interval

Our next topic focuses on calculating the 95% confidence interval (CI) using the Wald method. When estimating the prevalence of malaria in a sample, we start with a point estimate. However, it’s equally important to calculate the 95% CI to capture the variability around this estimate. The CI provides an interval with defined lower and upper bounds, representing the range within which we are 95% confident the true population prevalence lies. In other words, if we were to repeat the sampling process many times and calculate a confidence interval for each sample, approximately 95 out of 100 of these intervals would contain the true population prevalence.

Estimating prevalence in our sample

For this exercise we have already pre-calculated some useful parameters:

Defined sample_size to be 200
Defined infected_count to be the number of infected individuals in our sample (in our example it is 69 individuals)
Defined the function sampleFromPopulation() to select 200 individuals at random from our census

Below is the code we ran for reference, but you don’t have to run it yourself as everything is already loaded.

# set the sample size
sample_size <- 200

# sample from the population
sample_data <- sampleFromPopulation(sample_size, census) 

# Count number of infected individuals in the sample
infected_count <- sum(sample_data$malaria_infection == "Infected")

What is the estimated prevalence of malaria in our sample?

We can calculate this by dividing the number of individuals infected with malaria by our sample size.

Try coding it yourself, or click on the solution. Note: In R when we want to divide two things we can use /.

69 / 200

# Or you can use the stored variables:
infected_count / sample_size

Click to see the answer

Our estimated prevalence is 0.345 or 34.5%.

Calculating the 95% CI

Next, we need to calculate the 95% CI around our point estimate.

This is the formula for the Wald confidence interval:

\[ CI = \hat{p} \pm z_{1 - \alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] Where:

\(\hat{p}\) = sample proportion
\({n}\) = sample size
\(\ z_{1 - \alpha/2}\) = critical value of the normal distribution at significance level \(\alpha\) (two sided)

We will now go through this formula step-by-step!

1. Defining our sample proportion, \(\hat{p}\)

The sample proportion refers to the proportion of infected individuals in our sample. We just calculated this above by dividing the number of infected individuals in the sample by the total sample size. Let’s do it again for good measure, and record it as p_hat.

p_hat <-

# Sample proportion
p_hat <- infected_count / sample_size
p_hat

2. Sample size, \(n\)

Above, we defined our sample size to be n=200 and recorded it as sample_size.

3. Calculating the standard error

We can calculate the standard error using p_hat and sample_size. We can calculate the standard error using:

\[ SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Note: In R the function to take the square root is sqrt() and when we want to multiply two things we can use *.

SE <-

# Standard error
SE <- sqrt((p_hat * (1 - p_hat)) / sample_size)
SE

Click to see the answer

Our standard error is 0.034.

4. Calculating the confidence interval

Now that we know our standard error, we can calculate the lower and upper bounds of our 95% confidence interval (CI). To do this, we multiply the standard error by \(z_{1-\alpha/2}\) , which is approximately 1.96 for a 95% CI. This factor is derived from the normal distribution, where 95% of the probability lies within 1.96 standard deviations of the mean, ensuring our interval reflects this range of uncertainty.

Let’s start with the lower bound. Click on the solution if you need help.

# Critical value for 95% confidence
z_alpha <- 1.96

# Lower bound

# Critical value for 95% confidence
z_alpha <- 1.96

# Lower bound
p_hat - z_alpha * SE

Click to see the answer

Our lower bound is 0.279 or 27.9%.

Now let’s calculate the upper bound. Remember now we need to add instead of subtract.

# Critical value for 95% confidence
z_alpha <- 1.96

# Upper bound

# Critical value for 95% confidence
z_alpha <- 1.96

# Upper bound
p_hat + z_alpha * SE

Click to see the answer

Our upper bound is 0.411 or 41.1%.

Let’s put it all together!

infected_count <- 69
sample_size <- 200
z_alpha <- 1.96

p_hat <- infected_count / sample_size
SE <- sqrt((p_hat * (1 - p_hat)) / sample_size)
CI_lower <- p_hat - z_alpha * SE
CI_upper <- p_hat + z_alpha * SE

# Print our values
p_hat
CI_lower
CI_upper

So, putting it all together, our estimated prevalence is 0.345 or 34.5% and our 95%CI is 27.9% to 41.1%.

Comparing to the true prevalence

You may remember from our exploration of the census data earlier, that 26.78% of our population was infected with malaria, in others this is the true prevalence.

QUIZ - 95% confidence interval

Reflection:

Do you always expect the true prevalence to fall within the 95%CI?

How often does true prevalence fall within the 95%CI?

We explored above what would happen if we randomly sampled 200 individuals. In this first example, the true prevalence didn’t fall within the 95%CI. But this was just one example. Now we want to see what happens if we repeat this sampling many times. Let’s now explore by running a simulation where we sample 1000 times and we will count how many times our true prevalence is within the 95%CI.

Reflection:

Before you run the below code, think about the intuition behind this - how often do you expect the true prevalence to be within the 95%CI?

Now run the code and see if you were correct!

n_simulations <- 1000
sample_size <- 200

results <- replicate(n_simulations, simulate_CI(census, sample_size, true_prevalence))

plotCISimulationResults(results, n_simulations)

QUIZ - Confidence Intervals and Simulation

After running the simulation, you should notice that the true prevalence falls within the 95% confidence intervals in approximately 95% of the simulations. This outcome aligns with the definition of a 95% confidence interval: if we were to repeat the sampling process many times, we would expect the true parameter to lie within the calculated confidence interval about 95 out of 100 times.

Reflection:

This simulation demonstrates the concept of confidence level in statistical inference. It shows that the method we use to calculate confidence intervals is reliable in the long run. However, in any single sample (like the one we initially took), there’s still a chance (about 5%) that the true prevalence will not be captured within the interval. This is why it’s important to interpret confidence intervals correctly and understand that they provide a measure of the uncertainty associated with our estimates.

✨ BONUS QUESTIONS ✨

Note: This section is optional and requires more coding than the previous exercise

Let’s repeat this exercise for a sample size of 500. Try coding it yourself from scratch using the functions that we used above. Click on the solution if you get stuck!

sample_size <- 500

sample_size <- 500
sample_data <- sampleFromPopulation(sample_size, census)
comparison <- compareSampleToPopulation(sample_data, census)

plotAgeSexDistribution(comparison)
plotInfectedProportion(comparison)

infected_count <- sum(sample_data$malaria_infection == "Infected")

p_hat <- infected_count / sample_size
SE <- sqrt((p_hat * (1 - p_hat)) / sample_size)
CI_lower <- p_hat - Z * SE
CI_upper <- p_hat + Z * SE

checkPrevalenceCI(true_prevalence, CI_lower, CI_upper)

n_simulations <- 1000
results <- replicate(n_simulations, simulate_CI(census, sample_size, true_prevalence))
plotCISimulationResults(results, n_simulations)

Reflection:

Does the true prevalence (26.78%) fall within our 95% CI? What do you notice about the simulation results?