Introduction to R: Data import and data frames

Malte Bonart

Data import

Table data

  • most of the available data is structured in a table representation
    • rows: observations, cases, subjects
    • columns: variables, features
  • table data can be stored
    • in plain text: .txt, .csv, .tsv, …
    • in a binary format: Excel, Stata, SPSS, …
    • in a relational database: SQL, …

Text table data

  • central function is read.table() to read table data from the hard drive
  • questions to answer before importing table data:
    • has the data a header line?
    • how are values separated (e.g. "," or ";")
    • what’s the encoding (e.g. UTF-8 or latin-1)
  • use the argument stringsAsFactors = FALSE

Binary table data

  • there exists various packages for importing binary table data
    • haven package for Stata, SAS and SPSS files
    • readxl package for Excel files
  • usually one can always export the data to text format and then import it using read.table

Other data formats

  • Web APIs often represent data in json or xml (text-)format
    • packages XML and jsonlite
  • R uses the binary .RData or .rds format to save objects persistently
  • Spatial data is often stored in a GIS file format (shapefile)
    • packages maptools

Data frames

Introduction

  • when importing table data, the natural R representation is a data.frame
  • two-dimensional data structure with rows and columns
  • each column is a vector of some data-type (e.g. numeric, logical, character, …)

Example: creation

myDF <- data.frame(1:10, rep(TRUE, 10), letters[1:10])
myDF
   X1.10 rep.TRUE..10. letters.1.10.
1      1          TRUE             a
2      2          TRUE             b
3      3          TRUE             c
4      4          TRUE             d
5      5          TRUE             e
6      6          TRUE             f
7      7          TRUE             g
8      8          TRUE             h
9      9          TRUE             i
10    10          TRUE             j

Example: data import

titanic <- read.csv("./www/titanic.csv", stringsAsFactors = FALSE)
titanic <- read.table("./www/titanic.csv", header = TRUE, 
                      sep = ",", stringsAsFactors = FALSE)
head(titanic)
  X.1 X pclass survived                                            name
1   1 1      1        1                   Allen, Miss. Elisabeth Walton
2   2 2      1        1                  Allison, Master. Hudson Trevor
3   3 3      1        0                    Allison, Miss. Helen Loraine
4   4 4      1        0            Allison, Mr. Hudson Joshua Creighton
5   5 5      1        0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
6   6 6      1        1                             Anderson, Mr. Harry
     sex     age embarked
1 female 29.0000        S
2   male  0.9167        S
3 female  2.0000        S
4   male 30.0000        S
5 female 25.0000        S
6   male 48.0000        S

Example: summary function

summary(titanic)
      X.1             X            pclass         survived    
 Min.   :   1   Min.   :   1   Min.   :1.000   Min.   :0.000  
 1st Qu.: 328   1st Qu.: 328   1st Qu.:2.000   1st Qu.:0.000  
 Median : 655   Median : 655   Median :3.000   Median :0.000  
 Mean   : 655   Mean   : 655   Mean   :2.295   Mean   :0.382  
 3rd Qu.: 982   3rd Qu.: 982   3rd Qu.:3.000   3rd Qu.:1.000  
 Max.   :1309   Max.   :1309   Max.   :3.000   Max.   :1.000  
                                                              
     name               sex                 age         
 Length:1309        Length:1309        Min.   : 0.1667  
 Class :character   Class :character   1st Qu.:21.0000  
 Mode  :character   Mode  :character   Median :28.0000  
                                       Mean   :29.8811  
                                       3rd Qu.:39.0000  
                                       Max.   :80.0000  
                                       NA's   :263      
   embarked        
 Length:1309       
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Length

  • use the functions nrows(), ncols()
  • length() also gives the number of columns
nrow(titanic)
[1] 1309
ncol(titanic)
[1] 8
length(titanic)
[1] 8

Names

  • function colnames() or names() for the column names
  • function rownames() for the row names (used less frequently)
colnames(titanic)
[1] "X.1"      "X"        "pclass"   "survived" "name"     "sex"     
[7] "age"      "embarked"

Extraction and adding of variables

Column extraction

  • each column of a data.frame is a vector of some data type
  • columns can be extracted using the $ sign
class <- titanic$pclass
class
   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [69] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [103] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [137] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [171] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [307] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [341] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [375] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [409] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [443] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [477] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [511] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [545] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [579] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3
 [613] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [647] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [681] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [715] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [749] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [783] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [817] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [851] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [885] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [919] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [953] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [987] 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [ reached getOption("max.print") -- omitted 309 entries ]

Column creation

titanic$isOld <- titanic$age > 60
titanic$isOld
   [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
  [12] FALSE FALSE FALSE  TRUE    NA FALSE FALSE FALSE FALSE FALSE FALSE
  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [34] FALSE FALSE FALSE FALSE    NA FALSE FALSE    NA FALSE FALSE FALSE
  [45] FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
  [56] FALSE FALSE FALSE FALSE    NA FALSE  TRUE FALSE FALSE FALSE FALSE
  [67] FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE FALSE
  [78] FALSE  TRUE FALSE    NA  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE
  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA    NA FALSE
 [111] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE    NA FALSE FALSE
 [122]    NA FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE
 [133] FALSE FALSE    NA  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [144] FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE    NA FALSE
 [155] FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [166] FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [177]    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE    NA FALSE FALSE
 [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE
 [199] FALSE FALSE FALSE FALSE FALSE FALSE    NA  TRUE FALSE FALSE FALSE
 [210] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE    NA
 [221] FALSE  TRUE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [232] FALSE FALSE FALSE FALSE    NA FALSE    NA FALSE FALSE FALSE    NA
 [243] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [254] FALSE    NA FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [265] FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE
 [276] FALSE FALSE    NA FALSE  TRUE FALSE FALSE FALSE    NA  TRUE  TRUE
 [287]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE
 [298]    NA FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [309] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
 [320] FALSE    NA  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [331] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [342] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [353] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [364]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [375] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE    NA
 [386] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [408] FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [419] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [430] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [441] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [452] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [463] FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE
 [474]    NA FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE    NA
 [485] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE    NA FALSE FALSE FALSE
 [496]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [507]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [518] FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE
 [529]    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [540] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [551] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [562] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [573] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE
 [584] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [595]  TRUE    NA FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [606] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [617] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [628] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [639] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [650] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [672] FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA
 [683]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [694] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [705] FALSE    NA    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [716] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [727] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [738] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [749] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE
 [760] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE
 [771] FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE
 [782] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE
 [793] FALSE FALSE FALSE    NA FALSE FALSE    NA FALSE    NA    NA    NA
 [804] FALSE    NA    NA FALSE FALSE    NA FALSE FALSE FALSE    NA    NA
 [815] FALSE    NA    NA FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE
 [826] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA
 [837] FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE
 [848] FALSE FALSE FALSE FALSE FALSE    NA FALSE    NA FALSE    NA FALSE
 [859]    NA FALSE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE
 [870] FALSE FALSE    NA    NA FALSE    NA FALSE    NA FALSE FALSE    NA
 [881] FALSE FALSE    NA FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE
 [892] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE    NA    NA
 [903]    NA    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [914] FALSE FALSE FALSE FALSE FALSE    NA FALSE    NA    NA    NA    NA
 [925] FALSE FALSE    NA    NA    NA    NA    NA    NA FALSE FALSE FALSE
 [936] FALSE FALSE FALSE FALSE FALSE    NA FALSE    NA FALSE    NA    NA
 [947]    NA FALSE    NA FALSE FALSE FALSE FALSE FALSE    NA    NA    NA
 [958]    NA    NA FALSE FALSE    NA    NA FALSE FALSE FALSE FALSE FALSE
 [969] FALSE FALSE FALSE    NA FALSE    NA FALSE FALSE    NA FALSE FALSE
 [980] FALSE FALSE FALSE    NA    NA    NA FALSE FALSE    NA    NA    NA
 [991] FALSE    NA FALSE    NA    NA FALSE FALSE    NA    NA    NA
 [ reached getOption("max.print") -- omitted 309 entries ]

Overriding of existing columns

titanic$age <- 12
head(titanic$age)
[1] 12 12 12 12 12 12

Deletion of columns

titanic$age <- NULL
head(titanic)
  X.1 X pclass survived                                            name
1   1 1      1        1                   Allen, Miss. Elisabeth Walton
2   2 2      1        1                  Allison, Master. Hudson Trevor
3   3 3      1        0                    Allison, Miss. Helen Loraine
4   4 4      1        0            Allison, Mr. Hudson Joshua Creighton
5   5 5      1        0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
6   6 6      1        1                             Anderson, Mr. Harry
     sex embarked isOld
1 female        S FALSE
2   male        S FALSE
3 female        S FALSE
4   male        S FALSE
5 female        S FALSE
6   male        S FALSE

Frequency and contingency tables

Introduction

  • used to represent the distribution of one or two categorical variables
  • total frequencies
  • relative frequencies (row, column, or absolute percentage)

Example total frequencies

table(titanic$pclass)

  1   2   3 
323 277 709 
table(titanic$sex, titanic$pclass)
        
           1   2   3
  female 144 106 216
  male   179 171 493

Example relative frequencies, row percentage

tab <- table(titanic$sex, titanic$pclass)
tab
        
           1   2   3
  female 144 106 216
  male   179 171 493
prop.table(tab, margin = 1)
        
                 1         2         3
  female 0.3090129 0.2274678 0.4635193
  male   0.2123369 0.2028470 0.5848161

Example relative frequencies, column percentage

tab <- table(titanic$sex, titanic$pclass)
tab
        
           1   2   3
  female 144 106 216
  male   179 171 493
prop.table(tab, margin = 2)
        
                 1         2         3
  female 0.4458204 0.3826715 0.3046544
  male   0.5541796 0.6173285 0.6953456

Example relative frequencies, absolute percentage

tab <- table(titanic$sex, titanic$pclass)
tab
        
           1   2   3
  female 144 106 216
  male   179 171 493
prop.table(tab, margin = NULL)
        
                  1          2          3
  female 0.11000764 0.08097785 0.16501146
  male   0.13674561 0.13063407 0.37662338