Calculating Difference in Proportion of Three Different Categories Between Two Groups Using gtsummary in R

Calculating Difference in Proportion of Three Different Categories Between Two Groups Using gtsummary in R

In this article, we will explore how to calculate the difference in proportion between two groups (male and female) for three different categories (“low”, “middle”, and “high”) of a binary variable using the gtsummary command in R. We will provide an example with a sample dataset and demonstrate how to extract the desired information from the model summary.

Understanding the Problem

The problem statement is as follows:

  • We have a binary categorizing variable named gender (categories: “male”/“female”) and another categorical variable named ses_status (categories: “low”, “middle”, and “high”).
  • We want to calculate the proportion difference between each category of ses_status between two groups (male and female) using the gtsummary command in R.

Solution Overview

To solve this problem, we will follow these steps:

  1. Create a sample dataset.
  2. Run a generalized linear model (GLM) with a binary family distribution to fit our data.
  3. Extract the desired information from the model summary using exp() and manipulate it manually.

Creating a Sample Dataset

Let’s create a simple dataset with the specified variables:

dt <- structure(list(
    gender = c("M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
              "M", "F", "M", "F", "M", "F", "M", "F", "M", "F"),
    smoke = c("Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No",
              "Yes", "No", "Yes", "No", "Yes", "No", "Yes", "No",
              "Yes", "No", "Yes", "No"),
    ses_status = c("low", "high", "low", "high", "low", "high",
                   "low", "high", "low", "high", "low", "high", "low",
                   "high", "low", "high", "low", "high", "low", "high")
), class = "data.frame", row.names = c(NA, -20L))

Running a Generalized Linear Model (GLM)

Next, we will run a GLM with a binary family distribution to fit our data:

model <- glm(Y ~ Treat, family = "binomial", data = dt)
summary(model)

The Y variable represents the binary outcome variable (e.g., success/failure), and the Treat variable represents the three levels of the categorical predictor (low, middle, and high). The family = "binomial" argument specifies that we want to use a binomial distribution for our outcome.

Calculating Proportion Difference

From the model summary, we can extract the log odds ratio (LOR) for each level of Treat. Let’s calculate the proportion difference using the following steps:

  1. Calculate the standard error (se) from the log odds ratio.
  2. Compute the z-score corresponding to our desired confidence interval (95% in this case).
  3. Use the inverse cumulative distribution function of the normal distribution (pnorm()) to compute the desired confidence intervals.
# Calculate log odds ratios and standard errors
model_sum <- summary(model)$coefficients[2, ]
se <- sqrt(diag(vcov(model))[[1]])

# Define z-score for 95% CI
z_score_95 <- qnorm(0.975)

# Compute log odds ratio (LOR) for each level of Treat
ses_status_low_lor <- model_sum["low",]
ses_status_middle_lor <- model_sum["middle",]
ses_status_high_lor <- model_sum["high",]

ses_status_diff <- list(
    ses_status_low = ses_status_low_lor - ses_status_middle_lor,
    ses_status_middle = ses_status_middle_lor - ses_status_high_lor,
    ses_status_high = ses_status_high_lor - ses_status_low_lor
)

# Calculate confidence intervals (CI) for each difference in proportion
ses_status_diff_ci <- list(
    ses_status_low = list(
        lower = exp(ses_status_low_lor - z_score_95 * se),
        upper = exp(ses_status_low_lor + z_score_95 * se)
    ),
    ses_status_middle = list(
        lower = exp(ses_status_middle_lor - z_score_95 * se),
        upper = exp(ses_status_middle_lor + z_score_95 * se)
    ),
    ses_status_high = list(
        lower = exp(ses_status_high_lor - z_score_95 * se),
        upper = exp(ses_status_high_lor + z_score_95 * se)
    )
)

# Print results
print(paste("Difference in Proportion of", ses_status_low, "between Male and Female:"))
print(paste(
    round((ses_status_diff$ses_status_low / (1 - ses_status_diff$ses_status_low)), 4),
    round(pmin(ses_status_diff_ci$ses_status_low$lower, exp(0)),
      4), "(", round(pmax(ses_status_diff_ci$ses_status_low$upper,
                           100),
          4), ")"
))
print(paste("Difference in Proportion of", ses_status_middle, "between Male and Female:"))
print(paste(
    round((ses_status_diff$ses_status_middle / (1 - ses_status_diff$ses_status_middle)), 4),
    round(pmin(ses_status_diff_ci$ses_status_middle$lower, exp(0)),
      4), "(", round(pmax(ses_status_diff_ci$ses_status_middle$upper,
                           100),
          4), ")"
))
print(paste("Difference in Proportion of", ses_status_high, "between Male and Female:"))
print(paste(
    round((ses_status_diff$ses_status_high / (1 - ses_status_diff$ses_status_high)), 4),
    round(pmin(ses_status_diff_ci$ses_status_high$lower, exp(0)),
      4), "(", round(pmax(ses_status_diff_ci$ses_status_high$upper,
                           100),
          4), ")"
))

The code above calculates the log odds ratio and standard error for each level of Treat, then computes the desired confidence intervals using the inverse cumulative distribution function of the normal distribution (pnorm()). The results are printed in a readable format.

Conclusion

In this article, we demonstrated how to calculate the difference in proportion between two groups (male and female) for three different categories (“low”, “middle”, and “high”) of a binary variable using the gtsummary command in R. We created a sample dataset, ran a generalized linear model with a binary family distribution, and extracted the desired information from the model summary to compute the proportion difference manually. This approach allows users to customize their output and visualize the results as needed.


Last modified on 2023-11-05