Calculating Difference in Proportion of Three Different Categories Between Two Groups Using gtsummary in R
In this article, we will explore how to calculate the difference in proportion between two groups (male and female) for three different categories (“low”, “middle”, and “high”) of a binary variable using the gtsummary
command in R. We will provide an example with a sample dataset and demonstrate how to extract the desired information from the model summary.
Understanding the Problem
The problem statement is as follows:
- We have a binary categorizing variable named
gender
(categories: “male”/“female”) and another categorical variable namedses_status
(categories: “low”, “middle”, and “high”). - We want to calculate the proportion difference between each category of
ses_status
between two groups (male
andfemale
) using thegtsummary
command in R.
Solution Overview
To solve this problem, we will follow these steps:
- Create a sample dataset.
- Run a generalized linear model (GLM) with a binary family distribution to fit our data.
- Extract the desired information from the model summary using
exp()
and manipulate it manually.
Creating a Sample Dataset
Let’s create a simple dataset with the specified variables:
dt <- structure(list(
gender = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
"M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
smoke = c("Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No",
"Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No",
"Yes", "No", "Yes", "No"),
ses_status = c("low", "high", "low", "high", "low", "high",
"low", "high", "low", "high", "low", "high", "low",
"high", "low", "high", "low", "high", "low", "high")
), class = "data.frame", row.names = c(NA, -20L))
Running a Generalized Linear Model (GLM)
Next, we will run a GLM with a binary family distribution to fit our data:
model <- glm(Y ~ Treat, family = "binomial", data = dt)
summary(model)
The Y
variable represents the binary outcome variable (e.g., success/failure), and the Treat
variable represents the three levels of the categorical predictor (low
, middle
, and high
). The family = "binomial"
argument specifies that we want to use a binomial distribution for our outcome.
Calculating Proportion Difference
From the model summary, we can extract the log odds ratio (LOR) for each level of Treat
. Let’s calculate the proportion difference using the following steps:
- Calculate the standard error (
se
) from the log odds ratio. - Compute the z-score corresponding to our desired confidence interval (95% in this case).
- Use the inverse cumulative distribution function of the normal distribution (
pnorm()
) to compute the desired confidence intervals.
# Calculate log odds ratios and standard errors
model_sum <- summary(model)$coefficients[2, ]
se <- sqrt(diag(vcov(model))[[1]])
# Define z-score for 95% CI
z_score_95 <- qnorm(0.975)
# Compute log odds ratio (LOR) for each level of Treat
ses_status_low_lor <- model_sum["low",]
ses_status_middle_lor <- model_sum["middle",]
ses_status_high_lor <- model_sum["high",]
ses_status_diff <- list(
ses_status_low = ses_status_low_lor - ses_status_middle_lor,
ses_status_middle = ses_status_middle_lor - ses_status_high_lor,
ses_status_high = ses_status_high_lor - ses_status_low_lor
)
# Calculate confidence intervals (CI) for each difference in proportion
ses_status_diff_ci <- list(
ses_status_low = list(
lower = exp(ses_status_low_lor - z_score_95 * se),
upper = exp(ses_status_low_lor + z_score_95 * se)
),
ses_status_middle = list(
lower = exp(ses_status_middle_lor - z_score_95 * se),
upper = exp(ses_status_middle_lor + z_score_95 * se)
),
ses_status_high = list(
lower = exp(ses_status_high_lor - z_score_95 * se),
upper = exp(ses_status_high_lor + z_score_95 * se)
)
)
# Print results
print(paste("Difference in Proportion of", ses_status_low, "between Male and Female:"))
print(paste(
round((ses_status_diff$ses_status_low / (1 - ses_status_diff$ses_status_low)), 4),
round(pmin(ses_status_diff_ci$ses_status_low$lower, exp(0)),
4), "(", round(pmax(ses_status_diff_ci$ses_status_low$upper,
100),
4), ")"
))
print(paste("Difference in Proportion of", ses_status_middle, "between Male and Female:"))
print(paste(
round((ses_status_diff$ses_status_middle / (1 - ses_status_diff$ses_status_middle)), 4),
round(pmin(ses_status_diff_ci$ses_status_middle$lower, exp(0)),
4), "(", round(pmax(ses_status_diff_ci$ses_status_middle$upper,
100),
4), ")"
))
print(paste("Difference in Proportion of", ses_status_high, "between Male and Female:"))
print(paste(
round((ses_status_diff$ses_status_high / (1 - ses_status_diff$ses_status_high)), 4),
round(pmin(ses_status_diff_ci$ses_status_high$lower, exp(0)),
4), "(", round(pmax(ses_status_diff_ci$ses_status_high$upper,
100),
4), ")"
))
The code above calculates the log odds ratio and standard error for each level of Treat
, then computes the desired confidence intervals using the inverse cumulative distribution function of the normal distribution (pnorm()
). The results are printed in a readable format.
Conclusion
In this article, we demonstrated how to calculate the difference in proportion between two groups (male and female) for three different categories (“low”, “middle”, and “high”) of a binary variable using the gtsummary
command in R. We created a sample dataset, ran a generalized linear model with a binary family distribution, and extracted the desired information from the model summary to compute the proportion difference manually. This approach allows users to customize their output and visualize the results as needed.
Last modified on 2023-11-05