Transforming Size Variables in ggplot2: A Step-by-Step Guide

Understanding the Issue with Size Variables in ggplot2

=====================================================

In this article, we will explore an issue with size variables in ggplot2 and provide a step-by-step guide on how to transform the size variable in p.data to get back the original size variable.

Problem Statement


The problem arises when using ggplot2 to create a scatter plot where the size variable is used as a factor. In this case, the size variable seems to be mutating or transforming into a new value during the creation of the plot. We will demonstrate this issue with an example code snippet.

Example Code


require(ggplot2)
require(dplyr)

set.seed(1234)
d <- data.frame(x = rnorm(100), y = rnorm(100), size = runif(100))
p.out <- ggplot(d, aes(x, y, size = size)) + geom_point()
p.data <- p.out %>% layer_data() %>% arrange(x)
d2 <- d %>% arrange(x)

head(d2)
             x           y      size
## 1 -2.345698 -0.50247778 0.7757949
## 2 -2.180040 -0.31611833 0.3802893
## 3 -1.806031 -0.37723765 0.2547007
## 4 -1.629093 -1.65010093 0.2722072
## 5 -1.448205  0.08005964 0.1999333
## 6 -1.390701 -1.12376279 0.5117742

p.data %>% select(size, x, y) %>% head

##       size         x           y
## 1 5.407443 -2.345698 -0.50247778
## 2 4.084550 -2.180040 -0.31611833
## 3 3.523348 -1.806031 -0.37723765
## 4 3.608829 -1.629093 -1.65010093
## 5 3.234916 -1.448205  0.08005964
## 6 4.579018 -1.390701 -1.12376279

lm(y ~ x, p.data)

## Call:
## lm(formula = y ~ x, data = p.data)
## 
## Coefficients:
## (Intercept)            x  
##     0.03715     -0.02608  

lm(y ~ x, d)

## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##     0.03715     -0.02608  

cor(p.data$size, d2$size)

## [1] 0.9783827

lm(y ~ x, data = d, weights = size)

## Call:
## lm(formula = y ~ x, data = d, weights = size)
## 
## Coefficients:
## (Intercept)            x  
##    -0.02586     -0.11537  

lm(y ~ x, p.data, weights = size)

## Call:
## lm(formula = y ~ x, data = p.data, weights = size)
## 
## Coefficients:
## (Intercept)            x  
##     0.009372    -0.065445  

As we can see from the lm() call with the original data (d), the coefficients for (Intercept) and x are identical to those in the same model created using the transformed data (p.data). However, when looking at the size variable in p.data, it seems to be mutated or transformed into a new value.

Solution


To get back the original size variable in p.data, we can try calling the data from the ggplot object directly using p.out$data.

# Create the plot
p.out <- ggplot(d, aes(x, y, size = size)) + geom_point()

# Get the data from the ggplot object
data <- p.out$data

# Print the data
head(data)

##       x           y     size
## 1 -2.345698 -0.5024778 0.7757949
## 2 -2.180040 -0.3161183 0.3802893
## 3 -1.806031 -0.3772376 0.2547007
## 4 -1.629093 -1.6501009 0.2722072
## 5 -1.448205  0.0800596 0.1999333
## 6 -1.390701 -1.1237628 0.5117742

As we can see, the size variable in data is identical to the original size variable in d.

Conclusion


In conclusion, when using ggplot2 to create a scatter plot where the size variable is used as a factor, the size variable may seem to mutate or transform into a new value. However, by calling the data from the ggplot object directly using p.out$data, we can get back the original size variable.

Further Discussion


The issue with size variables in ggplot2 is related to how ggplot2 handles weights when creating models. When using a weight column as a factor, ggplot2 may modify the values in that column during the creation of the plot. This can result in the loss of information and accuracy.

To avoid this issue, it’s recommended to use other methods for handling sizes or scales in your plots, such as using the scale_size aesthetic or creating custom size variables.

By understanding how ggplot2 handles weights and factors, we can create more accurate and informative plots that provide meaningful insights into our data.

Additional Resources


We hope this article has helped you understand the issue with size variables in ggplot2 and provided a step-by-step guide on how to transform the size variable in p.data to get back the original size variable.


Last modified on 2023-09-15