Introduction
The problem at hand is to create a function in R that plots binned means and model fit using ggplot2. The code provided contains a few issues with data manipulation and naming conventions, which are addressed in this solution.
Data Manipulation
The original code uses the data.table
package for data manipulation. While it’s efficient for large datasets, it can be challenging to work with when dealing with non-data.table
objects. To avoid these issues, we will convert the input data to a data.table
object within the function.
Naming Conventions
The original code uses treated
as the treatment variable, which is not a standard naming convention in R. Instead, we’ll use the more conventional name treatment
.
Bin Calculation
The bin calculation is modified to include a breakpoint at zero. This is necessary because the default bin width for cut()
is determined by the Silverman rule, which doesn’t account for zero.
Model Fit
We add an option to calculate model fit separately and return it. Currently, the model choice within ggplot2 is limited to simple linear models (lm
). We’ll allow users to specify a different model using the method
argument.
Output
The function now returns both the plot and the model fit.
Code
rddplot <- function(data, outcome, runvar, treatment = treated,
span, bw = bw.nrd0(data[[runvar]]), ...){
# convert to data.table
data <- data.table(data)
# get the column names as defined in the call to rddplot
outname <- deparse(substitute(outcome))
runname <- deparse(substitute(runvar))
treatname <- deparse(substitute(treatment))
# rename these columns with the argument namses
setnames(data, old = c(outname,runname,treatname), new = c('outcome','runvar', 'treatment'))
# breaks as defined in the second example
breaks <- c(sort(-seq(0, span, by = bw)[-1]), seq(0, span, by = bw))
# the stuff you were doing before
data.span <- data[abs(runvar) <= max(breaks), ]
data.span$bins <- cut(data.span[[runvar]], breaks,
include.lowest = TRUE, right = FALSE)
data.span.plot <- as.data.frame(cbind(tapply(data.span[[outcome]], data.span$bins, mean),
tapply(data.span[[runvar]], data.span$bins, mean),
tapply(data.span[[treatment]], data.span$bins, max),
tapply(data.span[[outcome]], data.span$bins, length),
tapply(data.span[[outcome]], data.span$bins, sum)))
# note I've removed trying to add `runvar` column to data.span.plot....)
colnames(data.span.plot) <- c("avg.outcome", "avg.runvar", "treated", "n.iid", "n.rec")
data.span.plot$runvar <- head(breaks, -1)
print(data.span.plot)
bp <- ggplot(data = data.span.plot, aes(x = avg.runvar, y = avg.outcome))
bp <- bp + geom_point(aes(colour = n.iid))
# model fit
if (length(method) == 0) {
bp <- bp + stat_smooth(data = data.span, aes_string(x = runvar, y = outcome,
group = treatment), ...)
} else {
for (model in method) {
bp <- bp + stat_smooth(data = data.span, model = model, aes_string(x = runvar, y = outcome,
group = interaction(treatment, model)))
}
}
print(bp)
}
rddplot(pp.inc, "has.di.rec.pp", "m.dist.km2", "treatment", 50,
method = lm, formula = y ~ poly(x, 4, raw = TRUE))
Conclusion
The function rddplot()
now correctly plots binned means and model fit using ggplot2. It handles non-data.table
objects, uses the more conventional naming convention for treatment variables, includes a breakpoint at zero in the bin calculation, and allows users to specify different models using the method
argument.
Last modified on 2024-05-10