Introduction to R: Visualizations with ggplot2

Malte Bonart

Basics

Plotting systems in R

  • basegraphs
  • grid und lattice
  • ggplot2

Background

  • system for declaratively creating graphics, based on the “grammar of graphics
  • special syntax: components are added using the + operator
  • relevant data to plot should be collected in a data.frame
  • produces high quality plots and takes care of many details

Workflow

  • initialize a plot with the ggplot() function
  • supply a dataset
  • map aesthetics to variables: x, y, color, groups
  • add layers/geom: points, lines, histogram
  • add scales, faceting specification, coordinate systems, themes
  • save the plot

Installation of the ggplot2 package

install.packages("ggplot2")

Step by step guide

The data

use the build in dataset failthful

?faithful
head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Initialize the plot

any variables that are part of the source dataframe have to be provided inside the aes() function

library(ggplot2)
ggplot(faithful, aes(x = waiting, y = eruptions))

Add some points

additional layers have to be “added” with the + operator

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point()

Add a linear trend line

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point() + 
  geom_smooth(method='lm')

Change the color and size

ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point(col ="steelblue", size = 0.9) + 
  geom_smooth(method = 'lm', color = "black")

Add some labels

ggplot(faithful, aes(x = waiting, y = eruptions)) +
 geom_point(col ="steelblue", size = 0.9) + 
  geom_smooth(method = 'lm', color = "black") + 
  labs(title = "Old Faithful Geyser Data", 
       subtitle = "Waiting time between eruptions and the duration of the eruption", 
       x = "waiting time in mins", y = "eruption time in mins"
       )

Change the theme

ggplot(faithful, aes(x = waiting, y = eruptions)) +
 geom_point(col ="steelblue", size = 0.9) + 
  geom_smooth(method = 'lm', color = "black") + 
  labs(title = "Old Faithful Geyser Data", 
       subtitle = "Waiting time between eruptions and the duration of the eruption", 
       x = "waiting time in mins", y = "eruption time in mins"
       ) + 
  theme_minimal()

Simple Linear Regression

m <- lm(eruptions ~ waiting, data = faithful)
m <- summary(m)
m

Call:
lm(formula = eruptions ~ waiting, data = faithful)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.29917 -0.37689  0.03508  0.34909  1.19329 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
waiting      0.075628   0.002219   34.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4965 on 270 degrees of freedom
Multiple R-squared:  0.8115,    Adjusted R-squared:  0.8108 
F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16
rSquared <- round(m$r.squared, 2)

Add R^2 to the plot

ggplot(faithful, aes(x = waiting, y = eruptions)) +
 geom_point(col ="steelblue", size = 0.9) + 
  geom_smooth(method = 'lm', color = "black") + 
  labs(title = "Old Faithful Geyser Data", 
       subtitle = "Waiting time between eruptions and the duration of the eruption", 
       x = "waiting time in mins", y = "eruption time in mins"
       ) + 
 geom_label(x = 90, y = 1.5, size = 4,
            label = paste("Bestimmtheitsmaß:", rSquared))

Basic examples

Read in the data

titanic <- read.csv("./www/titanic.csv", stringsAsFactors = FALSE)
soccer <- read.csv2("./www/football.csv", stringsAsFactors = FALSE)

Histogram with subplots

ggplot(titanic, aes(x = age)) + 
  geom_histogram(binwidth = 5, na.rm = TRUE) +
  facet_grid(~ sex)

Bar chart by groups

ggplot(titanic, aes(x = pclass, y = age, fill = sex)) + 
  geom_bar(stat = "summary", fun.y = "mean", position = "dodge")
Warning: Removed 263 rows containing non-finite values (stat_summary).

Boxplot by groups

titanic$pclass <- as.character(titanic$pclass)
ggplot(titanic, aes(y = age, x = pclass)) + 
  geom_boxplot()
Warning: Removed 263 rows containing non-finite values (stat_boxplot).

Visualize contingency tables

ggplot(titanic, aes(y = sex, x = pclass)) + 
  geom_count()

Count entries by year

library(lubridate)
soccer$date <- as_date(soccer$date)
soccer$year <- year(soccer$date)
ggplot(soccer, aes(x = year)) + 
  geom_line(stat = "count")

References