Introduction to R: Further topics

Malte Bonart

How to continue?

Two additional important data structures in R
- list
- matrix
Iteration over data structures with the lapply- function
Aggregation and grouping with the dplyr package
Linear regression models with the lm function
Time and datetime values in R with the lubridate package
String manipulation with the stringr package

Save datasets

titanic <- read.csv(file = "./www/titanic.csv", stringsAsFactors = FALSE)

# text format
write.csv(titanic, file = "./www/titanic.csv") 

write.table(titanic, file = "./www/titanic.csv", 
            sep = ";", row.names = FALSE)


# binary 'R' format
saveRDS(titanic, file = "./www/titanic.rds") 

save(titanic, file = "./www/titanic.RData")

Overview of important data strucutures

	Homogen	Hetergogen
Eindimensional	vector c()	list list()
Zweidimensional	matrix matrix()	data.frame data.frame()

Lists

Introduction

One dimensional data structure
Each entry can be an arbitrary data strucutre (vector, matrice, data frame, function, list)
“Lists are like freight trains”
Usefull inside functions to return several results with different data structures

Example

l <- list(values = 1:10, fun = sum, innerlist = list(1, 2))
l

$values
 [1]  1  2  3  4  5  6  7  8  9 10

$fun
function (..., na.rm = FALSE)  .Primitive("sum")

$innerlist
$innerlist[[1]]
[1] 1

$innerlist[[2]]
[1] 2

length(l)

[1] 3

List element extraction: `L[...]`

Result of an extraction is always a list
Methods of extraction are the same as for vectors or data.frames
- logical vector
- positive (inclusive) or negative (exclusive) indices
- character vector (if list has names)

l[c(TRUE, FALSE, FALSE)]

$values
 [1]  1  2  3  4  5  6  7  8  9 10

l[c("innerlist", "fun")]

$innerlist
$innerlist[[1]]
[1] 1

$innerlist[[2]]
[1] 2


$fun
function (..., na.rm = FALSE)  .Primitive("sum")

Data extraction: `L[[...]]`

Extracts the content inside one list element
Only one element can be extracted
$-sign extraction possible (if list has names)

l[[1]]

 [1]  1  2  3  4  5  6  7  8  9 10

l$values

 [1]  1  2  3  4  5  6  7  8  9 10

l[["fun"]]

function (..., na.rm = FALSE)  .Primitive("sum")

l[["fun"]](c(1, 2, 3)) # directly apply the 'sum' function

[1] 6

Matrices

Introduction

Two-dimensional data structure where each entry has the same data-type
- Like vector but two-dimensional
- Like data.frame but only one data-type allowed (numeric, logical, character)
Special operators and functions for matrix algebra
- Inversion: solve()
- Multiplication: %*%
- Transpose: t()

Creation

matrix() function with at least two arguments:
- a vector c() of elements
- nrow or ncol specification
- matrix is filled by columns unless byrow = TRUE

Example

mat <- matrix(c(1, 3, 2, 6, 4, 5), ncol = 3)
mat

     [,1] [,2] [,3]
[1,]    1    2    4
[2,]    3    6    5

matrix(c(1, 3, 2, 6, 4, 5), ncol = 3, byrow = TRUE)

     [,1] [,2] [,3]
[1,]    1    3    2
[2,]    6    4    5

c(ncol(mat), nrow(mat))

[1] 3 2

Extration `M[i, j]`

Methods of extraction are the same as for vectors or data.frames
- logical vector
- positive (inclusive) or negative (exclusive) indices
- character vector (if matrix has rownames or colnames)

mat[c(1, 2), c(1, 2)]

     [,1] [,2]
[1,]    1    2
[2,]    3    6

The lapply family

Idea

apply: iteratation over the columns or rows of a matrix
lapply, sapply: iteration over a vector, list or the columns of a data.frame
avoids code repitions and errors
code becomes easier to read
alternative to for loops

The sapply function

sapply(X, FUN, ...)

X is a list a vector or a data frame
FUN is a function which is applied on each column of a data frame or each entry of the list or each value of a vector
... additional arguments of the function specified in FUN

Example I

titanic <- read.csv("./www/titanic.csv", stringsAsFactors = FALSE)
countNA <- function(x){
  return(sum(is.na(x)))
}
sapply(titanic, countNA)

     X.1        X   pclass survived     name      sex      age embarked 
       0        0        0        0        0        0      263        0

Example II

l <- list(rnorm(100), rnorm(100, mean = 3), rnorm(100, mean = -3))
lapply(l, mean)

[[1]]
[1] -0.04727389

[[2]]
[1] 2.999733

[[3]]
[1] -2.786998

`for` loop

result <- rep(NA, 100)
for(i in 1:100){
  result[i] <- mean(rnorm(n = 1000))
}

`sapply` loop

result <- sapply(1:100, function(i) {
  mean(rnorm(n = 1000))
})

`dplyr` package: Grouping and aggregation of data frames

Introduction

The dplyr package provides a powerfull grammar of data analysis
Commands are chained together with the piping operator %>%
Easier and cleaner way for a lot of common data.frame operations:
- grouping,
- sorting
- filtering
- aggregation
- conversion between long and wide format (with the tidyr package)

Example: Grouping and aggregation

library(dplyr)
titanic <- read.csv("./www/titanic.csv", stringsAsFactors = FALSE)
titanic %>%
  group_by(pclass, sex) %>%
  summarize(survived = mean(survived), meanAge = mean(age, na.rm = TRUE))

# A tibble: 6 x 4
# Groups:   pclass [?]
  pclass sex    survived meanAge
   <int> <chr>     <dbl>   <dbl>
1      1 female    0.965    37.0
2      1 male      0.341    41.0
3      2 female    0.887    27.5
4      2 male      0.146    30.8
5      3 female    0.491    22.2
6      3 male      0.152    26.0

Reference

reference

Linear regression models

Introduction

Linear regression is used in many areas
Model the linear relationship between an outcome $y$ and $k$ independent variables $x_j, j = 1, \dots, k$.
$y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_k x_{ik} + \epsilon_i$
$\epsilon \overset{iid}{\sim} N(0, \sigma^2)$

Simple linear regression in R

titanic <- read.csv("./www/titanic.csv", stringsAsFactors = FALSE)
model <- lm(survived ~ sex, data = titanic)
summary(model)


Call:
lm(formula = survived ~ sex, data = titanic)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7275 -0.1910 -0.1910  0.2725  0.8090 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.72747    0.01912   38.05   <2e-16 ***
sexmale     -0.53648    0.02382  -22.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4127 on 1307 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  0.279 
F-statistic: 507.1 on 1 and 1307 DF,  p-value: < 2.2e-16

Categorical variables

titanic$pclass <- as.factor(titanic$pclass)
model <- lm(survived ~ pclass, data = titanic)
summary(model)


Call:
lm(formula = survived ~ pclass, data = titanic)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6192 -0.2553 -0.2553  0.3808  0.7447 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.61920    0.02571  24.084  < 2e-16 ***
pclass2     -0.18959    0.03784  -5.011 6.17e-07 ***
pclass3     -0.36391    0.03102 -11.732  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4621 on 1306 degrees of freedom
Multiple R-squared:  0.09768,   Adjusted R-squared:  0.0963 
F-statistic: 70.69 on 2 and 1306 DF,  p-value: < 2.2e-16

Several variables

titanic$embarked <- as.factor(titanic$embarked)
model <- lm(survived ~ pclass + embarked, data = titanic)
summary(model)


Call:
lm(formula = survived ~ pclass + embarked, data = titanic)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6963 -0.3590 -0.2152  0.4475  0.7848 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.00000    0.32405   3.086 0.002072 ** 
pclass2     -0.14055    0.03923  -3.583 0.000352 ***
pclass3     -0.33732    0.03283 -10.276  < 2e-16 ***
embarkedC   -0.30369    0.32559  -0.933 0.351138    
embarkedQ   -0.32438    0.32819  -0.988 0.323141    
embarkedS   -0.44750    0.32539  -1.375 0.169283    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4583 on 1303 degrees of freedom
Multiple R-squared:  0.1145,    Adjusted R-squared:  0.1111 
F-statistic: 33.68 on 5 and 1303 DF,  p-value: < 2.2e-16

Predictions

pred <- predict.lm(model)
survived_pred <- pred > 0.5
table(titanic$survived, survived_pred)

   survived_pred
    FALSE TRUE
  0   669  140
  1   282  218

Out of sample predictions

model <- lm(survived ~ pclass + sex, data = titanic)
new <- data.frame(pclass = c("1", "2"), sex = c("male", "male"))
predict.lm(model, newdata = new, interval = "prediction", level = 0.90)

        fit        lwr      upr
1 0.3941135 -0.2571518 1.045379
2 0.2364034 -0.4149712 0.887778

Introduction to R: Further topics

Malte Bonart

How to continue?

Save datasets

Overview of important data strucutures

Lists

Introduction

Example

List element extraction: L[...]

Data extraction: L[[...]]

Matrices

Introduction

Creation

Example

Extration M[i, j]

The lapply family

Idea

The sapply function

Example I

Example II

for loop

sapply loop

dplyr package: Grouping and aggregation of data frames

Introduction

Example: Grouping and aggregation

Reference

Linear regression models

Introduction

Simple linear regression in R

Categorical variables

Several variables

Predictions

Out of sample predictions

List element extraction: `L[...]`

Data extraction: `L[[...]]`

Extration `M[i, j]`

`for` loop

`sapply` loop

`dplyr` package: Grouping and aggregation of data frames