Introduction
Welcome to Activity 2: Sample Size Calculation Base on Margin of Error
In this activity, we’ll explore how margin of error (MOE) arguments can guide sample size determination. This approach relies solely on the concept of precision; larger sample sizes lead to more precise estimates. This method is invaluable not only in Malaria Molecular Surveillance (MMS) but also in epidemiology more broadly, where estimating prevalence is often a primary goal.
Learning Outcomes
By the end of this tutorial, you will be able to:
- Calculate a 95% confidence interval from prevalence data.
- Calculate a minimum sample size using assumptions about prevalence and margin of error.
- Account for drop-out and malaria positive fraction.
Disclaimer: The scenarios in this document are entirely fictitious. While real place names are used, the data itself is artificial and designed for teaching purposes only. It does not necessarily represent the real epidemiological situation in these locations.
Background
You have been recruited by the National Malaria Control Programme (NMCP) of the Democratic Republic of the Congo (DRC) to assist with study design. The NMCP is concerned about the potential spread of mutations that confer partial resistance to the drug combination Sulfadoxine-Pyrimethamine (SP). The dhps K540E mutation, known to be associated with high level SP resistance when found alongside other common mutations, has recently been found at 72% prevalence in neighbouring Uganda. In the last few weeks there have been anecdotal reports of SP failure in Rutshuru town, which lies in Eastern DRC close to the border with Uganda. The NMCP is concerned that drug resistant parasites may be flowing over the border.
The NMCP plans to conduct a cross-sectional study to estimate the prevalence of the dhps K540E mutation in Rutshuru town. Your job is to work out the appropriate sample size for this study.
Results of a pilot study
Fortunately, a pilot study has already been conducted in Rutshuru town. This pilot study included 100 participants chosen at random from households within the town, who were tested for malaria via rapid diagnostic test (RDT). 23 people tested positive for malaria, and these samples were sent away for genetic sequencing. 19 samples were successfully sequenced, and 5 of these were found to carry the K540E mutation.
Recall that we can use the following formula to calculate a 95% confidence interval on our prevalence estimate:
\[ \hat{p} \pm z_{1-\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \] Complete the following R code to compute this interval:
# estimate the prevalence
x <-
n <-
p <- x / n
# calculate the margin of error (MOE)
z_alpha <- 1.96
MOE <- z_alpha * sqrt(p*(1 - p) / n)
# compute lower and upper 95% limits
p - MOE
p + MOE
# estimate the prevalence
x <- 5
n <- 19
p <- x / n
# calculate the margin of error (MOE)
z_alpha <- 1.96
MOE <- z_alpha * sqrt(p*(1 - p) / N)
# compute lower and upper 95% limits
CI_lower <- p - MOE
CI_upper <- p + MOE
c(CI_lower, CI_upper)
Click to see the answer
The 95% interval goes from 0.065 to 0.461, or in other words from 6.5% to 46.1%.
The 95% confidence interval is very wide, revealing there is considerable uncertainty about the true prevalence of K540E mutations. While our best estimate is 26.3%, the plausible range spans from 6.5% to 46.1%, which the NMCP finds too broad to be practically useful. They therefore plan to conduct a follow-up study to obtain a more precise estimate.
Calculating the appropriate sample size
When designing the new study, we will calculate the exact sample size needed to achieve a target margin of error (MOE). We can do this by rearranging the MOE formula to isolate the sample size (\(n\)) on the left side. If you’re comfortable with the math and would like to see the steps, follow along below. Otherwise, feel free to skip to Step 3 for the final formula.
Step 1: Write down the formula for the MOE
We will use the mathematical symbol \(m\) for the MOE:
\[ m = z_{1-\alpha/2}\sqrt{\frac{p(1-p)}{n}} \] Step 2: Square both sides
\[ m^2 = z_{1-\alpha/2}^2 \frac{p(1-p)}{n} \] Step 3: Multiply both sides by \(n\) and divide by \(m^2\)
\[ n = z_{1-\alpha/2}^2 \frac{p(1-p)}{m^2} \]
This leads us to a new formula that can be used to determine the appropriate sample size based on assumed values of \(p\) and \(m\). Our reason for working through the derivation of this formula was to show how closely connected it is to the formula for the confidence interval. In fact, it is the same mathematical expression, just reverse engineered to be in terms of \(n\)!
Now we need to decide what values to assume for \(p\) and \(m\).
In this tutorial we will assume a value of \(p=0.26\) to match the pilot data. The NMCP has decided that a MOE of 5% is acceptable. Complete the following R code to compute the resulting sample size:
# enter assumed values
p <-
m <-
# calculate the raw sample size
z <- 1.96
z^2*p*(1 - p) / m^2
# enter assumed values
p <- 0.26
m <- 0.05
# calculate the raw sample size
z <- 1.96
z^2*p*(1 - p) / m^2
Click to see the answer
We obtain a value of 295.65. We would round this up to \(n=296\) to give a whole number.
One of the nice things about sample size determination is that we can easily check that our calculation is correct.
Optional exercise: Try entering the values \(p=0.26\) and \(n=296\) into the 95% CI formula that we used in the pilot data analysis. If our calculations were correct, you should find that the resulting MOE is very close to 5%.
# Copy over the Wald formula code from the previous section, and edit to reflect
# the new assumed prevalence of 26% and sample size of 296
# enter assumed values
p <- 0.26
n <- 296
# calculate the margin of error (MOE)
z <- 1.96
z*sqrt(p*(1 - p) / n)
Click to see the answer
You should obtain a margin of error of 4.997%, which is very close to the target 5%. Notice that our MOE will always be equal or smaller than the target MOE. This is because we rounded the sample size up from 295.65 to 296.
Buffering
Buffering refers to increasing a sample size to allow for drop-out (loss of samples). Some ways that drop-out can occur are through:
- Participants withdrawing consent
- Participants dying or leaving the area
- Samples being lost during transportation
- Samples becoming contaminated
- Samples failing amplification or sequencing, resulting in a lack of genetic data
- Data being lost due to computational errors
We cannot completely eliminate the risk of drop-out, but by buffering sample sizes we can be robust to it. If we expect a proportion \(d\) of samples to be lost, then the formula for buffered sample size is:
\[ n_{\text{buffered}} = \frac{n_{\text{original}}}{1 - d} \]
Through consulting with lab technicians and the study team, you estimate that 10% of samples may be lost to drop-out. Complete the following R code to come up with a buffered sample size:
# enter sample size and estimated drop-out
n <- 296
d <- 0.1
# calculate the buffered sample size
n_buffered <-
print(n_buffered)
# enter sample size and estimated drop-out
n <- 296
d <- 0.1
# calculate the buffered sample size
n_buffered <- n / (1 - d)
print(n_buffered)
Click to see the answer
You should obtain a buffered sample size of 328.89, which we would round up to 329.
Accounting for positive fraction
So far, we have focused on working out how many confirmed malaria cases we need in our study. However, this will be a cross-sectional study with individuals being sampled at random from households within Rutshuru town, and many of these people will test negative for malaria. It would be useful for the study team to know how many individuals they need to test as part of this study, which may be considerably higher than the number of confirmed malaria cases.
The NMCP estimates that 25% of the population of Rutshuru will be positive for malaria by RDT. We can use the same buffering formula as before to account for this, but now using the positive fraction (\(f\)) to inflate our sample size:
\[ n_{\text{test}} = \frac{n_{\text{confirmed}}}{f} \]
Note that this inflating of the sample size to account for positive fraction is on top of buffering for drop-out. We can imagine a chain of events where we can lose samples at each stage; we want to ensure that in the end we still have enough samples remaining.
Complete the following R code to work out the number of people that need to be tested to achieve the final target sample size:
# enter buffered sample size and positive fraction
n <-
f <-
# calculate the testing sample size
n_test <-
print(n_test)
# enter buffered sample size and positive fraction
n <- 329
f <- 0.25
# calculate the testing sample size
N_test <- n / f
print(n_test)
Click to see the answer
You should find that 1316 people need to be tested.
You have now completed your study design exercise. Your recommendation to the NMCP is as follows:
Assuming a prevalence of K540E mutations of 26% based on pilot data, a sample size of 329 confirmed malaria cases will be needed to estimate prevalence to within 5% margin of error. This number is buffered to allow for 10% drop-out.
Assuming that malaria prevalence is 25% by RDT in Rutshuru town, this translates to 1316 individuals who will need to be tested in the cross-sectional study design.
Bonus questions
Our study design is carefully thought through, and based on strong statistical principles. However, it is worth testing how robust these numbers are to changes in our assumptions.
Under the chosen sample size of 296 (after drop-out), what would be your margin of error under the worst case scenario that the true prevalence of the K540E mutation was actually 50%?
# enter assumed values
p <- 0.5
n <- 296
# calculate the margin of error (MOE)
z <- 1.96
z*sqrt(p*(1 - p) / n)
Click to see the answer
The MOE would increase to 5.7% in the most pessimistic scenario. This is not terribly far from the target 5%.
We estimated that we will need to test 1316 people in order to achieve this final sample size, based on an assumed 25% prevalence of malaria in Rutshuru town. But what if during the study we found that malaria prevalence was only 15%? What would be your expected final sample size, and what would be your resulting MOE?
# enter assumed values
p <- 0.26
n_test <- 1316
# work out final sample size assuming 15% prevalence and 10% drop-out
n <- round(n_test * 0.15 * 0.9)
print(n)
# calculate the margin of error (MOE)
z <- 1.96
z*sqrt(p*(1 - p) / n)
Click to see the answer
Assuming 15% prevalence and 10% drop-out, our final sample size of successfully sequenced malaria cases would be around 178. This would result in a MOE of 6.4%.