Confidence Intervals for Proportions: A Step-by-Step Guide Using R and ggplot2

Introduction to Confidence Intervals for Proportions

Confidence intervals are a statistical tool used to estimate the population parameter of interest. In this article, we will explore how to plot a 95% confidence interval graph for one sample proportion.

What is a Sample Proportion?

A sample proportion represents the estimated probability of success in a finite population based on a random sample of observations. For example, suppose you are trying to determine the proportion of people who own a smartphone in your city. You take a random sample of 100 people and count how many of them own a smartphone. The resulting proportion is 80%, or 0.8.

What is a Confidence Interval?

A confidence interval is a range of values within which we believe that the true population parameter lies with a certain level of confidence. In this case, our goal is to estimate the true sample proportion. A 95% confidence interval means that if we were to take many samples from the same population and compute a confidence interval for each one, approximately 95% of those intervals would contain the true population proportion.

Types of Confidence Intervals

There are two main types of confidence intervals: point estimates with margins of error (also known as interval estimation) and confidence bands. Interval estimation provides an interval within which we believe that the true parameter lies, whereas confidence bands provide a range of values for the parameter that is guaranteed to contain the true value with a certain level of confidence.

How to Plot a 95% Confidence Interval Graph

To plot a 95% confidence interval graph for one sample proportion, you can use either the logistic regression approach or the binomial distribution approach. In this article, we will explore both approaches and provide examples using R and ggplot2.

Logistic Regression Approach

The logistic regression approach involves fitting a logistic regression model to the data where the only parameter being estimated is the intercept. The estimate of the intercept represents the log odds of success in the population.

Step 1: Fit the Logistic Regression Model

mod <- coef(summary(glm(x ~ 1, family = binomial, data = df)))

This code fits a logistic regression model to the data where x is the indicator variable for success (i.e., 1) and failure (i.e., 0).

Step 2: Compute the Density Values

xvals <- seq(mod[1] - 3 * mod[2], mod[1] + 3 * mod[2], 0.01)
yvals <- dnorm(xvals, mod[1], mod[2])

This code computes a range of x values from -3 to +3 standard errors away from the estimate, and then calculates the corresponding normal density values using dnorm().

Step 3: Compute the Proportion Values

probs <- pbinom(xvals, population, actual_successes/population)

This code computes the cumulative distribution function of the binomial distribution (pbinom()) for each x value computed in Step 2. The result is a vector of proportion values.

Step 4: Label the Proportions

label <- ifelse(probs < 0.025, "low", ifelse(probs > 0.975, "high", "CI"))

This code creates a label for each proportion value based on whether it falls within the desired confidence interval (i.e., between 2.5% and 97.5%).

Binomial Distribution Approach

The binomial distribution approach involves using the dbinom() and pbinom() functions to compute the density and cumulative distribution function of the binomial distribution.

Step 1: Define the Parameters

population <- 1500
actual_successes <- 105
test_successes <- 1:300

This code defines the population size, actual number of successes, and a vector of test success values from 1 to 300.

Step 2: Compute the Density Values

density <- dbinom(test_successes, population, actual_successes/population)
probs   <- pbinom(test_successes, population, actual_successes/population)
label   <- ifelse(probs < 0.025, "low", ifelse(probs > 0.975, "high", "CI"))

This code computes the binomial density values and corresponding proportion values using dbinom() and pbinom(), respectively. It also creates a label for each proportion value.

Example Code

Here is an example of how to plot a 95% confidence interval graph using both approaches:

# Logistic Regression Approach

mod <- coef(summary(glm(x ~ 1, family = binomial, data = df)))

xvals <- seq(mod[1] - 3 * mod[2], mod[1] + 3 * mod[2], 0.01)
yvals <- dnorm(xvals, mod[1], mod[2])

probs <- pbinom(xvals, population, actual_successes/population)

label <- ifelse(probs < 0.025, "low", ifelse(probs > 0.975, "high", "CI"))

ggplot(data.frame(probability = xvals, density = yvals, label), aes(probability, density, fill = label)) +
  geom_area(alpha = 0.5) +
  geom_vline(xintercept = actual_successes/population, linetype = 2) +
  scale_fill_manual(values = c("gray70", "deepskyblue4", "deepskyblue4"),
                    guide = guide_none()) +
  scale_x_continuous(limits = c(0.03, 0.13), breaks = 3:12/100,
                     name = "probability") +
  theme_bw()

# Binomial Distribution Approach

population <- 1500
actual_successes <- 105
test_successes <- 1:300

density <- dbinom(test_successes, population, actual_successes/population)
probs   <- pbinom(test_successes, population, actual_successes/population)
label   <- ifelse(probs < 0.025, "low", ifelse(probs > 0.975, "high", "CI"))

ggplot(data.frame(probability = test_successes/population, density, label),
       aes(probability, density, fill = label)) +
  geom_area(alpha = 0.5) +
  geom_vline(xintercept = actual_successes/population, linetype = 2) +
  scale_fill_manual(values = c("gray70", "deepskyblue4", "deepskyblue4"),
                    guide = guide_none()) +
  scale_x_continuous(limits = c(0.03, 0.13), breaks = 3:12/100,
                     name = "probability") +
  theme_bw()

The resulting plots will show the estimated proportions for each x value and their corresponding confidence intervals.

Conclusion

Confidence intervals are a useful statistical tool for estimating population parameters. In this article, we have explored how to plot a 95% confidence interval graph for one sample proportion using both logistic regression and binomial distribution approaches. By understanding how to compute and visualize these plots, you can make more informed decisions about your data and estimates.

Last modified on 2024-04-26