Understanding the Issue with Size Variables in ggplot2
=====================================================
In this article, we will explore an issue with size variables in ggplot2 and provide a step-by-step guide on how to transform the size
variable in p.data
to get back the original size
variable.
Problem Statement
The problem arises when using ggplot2 to create a scatter plot where the size
variable is used as a factor. In this case, the size
variable seems to be mutating or transforming into a new value during the creation of the plot. We will demonstrate this issue with an example code snippet.
Example Code
require(ggplot2)
require(dplyr)
set.seed(1234)
d <- data.frame(x = rnorm(100), y = rnorm(100), size = runif(100))
p.out <- ggplot(d, aes(x, y, size = size)) + geom_point()
p.data <- p.out %>% layer_data() %>% arrange(x)
d2 <- d %>% arrange(x)
head(d2)
x y size
## 1 -2.345698 -0.50247778 0.7757949
## 2 -2.180040 -0.31611833 0.3802893
## 3 -1.806031 -0.37723765 0.2547007
## 4 -1.629093 -1.65010093 0.2722072
## 5 -1.448205 0.08005964 0.1999333
## 6 -1.390701 -1.12376279 0.5117742
p.data %>% select(size, x, y) %>% head
## size x y
## 1 5.407443 -2.345698 -0.50247778
## 2 4.084550 -2.180040 -0.31611833
## 3 3.523348 -1.806031 -0.37723765
## 4 3.608829 -1.629093 -1.65010093
## 5 3.234916 -1.448205 0.08005964
## 6 4.579018 -1.390701 -1.12376279
lm(y ~ x, p.data)
## Call:
## lm(formula = y ~ x, data = p.data)
##
## Coefficients:
## (Intercept) x
## 0.03715 -0.02608
lm(y ~ x, d)
## Call:
## lm(formula = y ~ x, data = d)
##
## Coefficients:
## (Intercept) x
## 0.03715 -0.02608
cor(p.data$size, d2$size)
## [1] 0.9783827
lm(y ~ x, data = d, weights = size)
## Call:
## lm(formula = y ~ x, data = d, weights = size)
##
## Coefficients:
## (Intercept) x
## -0.02586 -0.11537
lm(y ~ x, p.data, weights = size)
## Call:
## lm(formula = y ~ x, data = p.data, weights = size)
##
## Coefficients:
## (Intercept) x
## 0.009372 -0.065445
As we can see from the lm()
call with the original data (d
), the coefficients for (Intercept)
and x
are identical to those in the same model created using the transformed data (p.data
). However, when looking at the size
variable in p.data
, it seems to be mutated or transformed into a new value.
Solution
To get back the original size
variable in p.data
, we can try calling the data from the ggplot object directly using p.out$data
.
# Create the plot
p.out <- ggplot(d, aes(x, y, size = size)) + geom_point()
# Get the data from the ggplot object
data <- p.out$data
# Print the data
head(data)
## x y size
## 1 -2.345698 -0.5024778 0.7757949
## 2 -2.180040 -0.3161183 0.3802893
## 3 -1.806031 -0.3772376 0.2547007
## 4 -1.629093 -1.6501009 0.2722072
## 5 -1.448205 0.0800596 0.1999333
## 6 -1.390701 -1.1237628 0.5117742
As we can see, the size
variable in data
is identical to the original size
variable in d
.
Conclusion
In conclusion, when using ggplot2 to create a scatter plot where the size
variable is used as a factor, the size
variable may seem to mutate or transform into a new value. However, by calling the data from the ggplot object directly using p.out$data
, we can get back the original size
variable.
Further Discussion
The issue with size
variables in ggplot2 is related to how ggplot2 handles weights when creating models. When using a weight column as a factor, ggplot2 may modify the values in that column during the creation of the plot. This can result in the loss of information and accuracy.
To avoid this issue, it’s recommended to use other methods for handling sizes or scales in your plots, such as using the scale_size
aesthetic or creating custom size variables.
By understanding how ggplot2 handles weights and factors, we can create more accurate and informative plots that provide meaningful insights into our data.
Additional Resources
We hope this article has helped you understand the issue with size variables in ggplot2 and provided a step-by-step guide on how to transform the size
variable in p.data
to get back the original size
variable.
Last modified on 2023-09-15