Some 99% of the time you will be working with primary or secondary data gathered in the course of a program evaluation. Often these data will have been stored in one or more Excel sheets. The file extension may be *.xlsx
or the older *.xls
but I would encourage you to consider saving your data in *.csv
or *.txt
formats because these do not mangle the file with needless bells and whistles. Every now and then you may be working with colleagues who lean on SPSS
, Stata
or SAS
and hence must work with data stored in each software’s proprietary format. On the rare occasion you may even have to wrestle with flat ascii
files, those relics from the days of mainframe computing. R will read these with ease.1
Let us see how R reads in data from existing files, starting with the two most common Excel formats. In the commands that follow you will see me using the here
library to tell R where the data file is located and the name of the file. This is useful for many reasons including the fact that whether someone is on windows versus mac versus linux or running RStudio Cloud becomes irrelevant. Otherwise you have to specify the full filepath and this leads to more headaches than blessings!
The file popn2019.xlsx
in the data
folder has data on the estimated annual population in 2019 of incorporated places in Ohio. Reading it should be simple with the readxl
library:
library(readxl)
library(here)
read_excel(
here("workshops/ropeg/handouts/data", "popn2019.xlsx")
) -> popn.xlsx
names(popn.xlsx) # show me the name of each column
[1] "Geography" "2019"
summary(popn.xlsx) # show me summary statistics
Geography 2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
If instead we wanted the *.xls
file:
read_excel(
here("workshops/ropeg/handouts/data", "popn2019.xls")
) -> popn.xls
names(popn.xls) # show me the name of each column
[1] "Geography" "2019"
summary(popn.xls) # show me summary statistics
Geography 2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
The *.csv
file can be read with readr
as follows:
library(readr)
read_csv(
here("workshops/ropeg/handouts/data", "popn2019.csv")
) -> popn.csv
names(popn.csv) # show me the name of each column
[1] "Geography" "2019"
summary(popn.csv) # show me summary statistics
Geography 2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
And that *.txt
file?
read_delim(
here("workshops/ropeg/handouts/data", "popn2019.txt"),
"\t"
) -> popn.txt
names(popn.txt) # show me the name of each column
[1] "Geography" "2019"
summary(popn.txt) # show me summary statistics
Geography 2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
Here the "\t"
switch is communicating to R that the columns are tab-delimited.
Now we lean on the haven
library but we could have also leaned on a few other libraries.2 Let us read in the three formats …
library(haven)
read_spss(
here("workshops/ropeg/handouts/data", "popn2019.sav")
) -> popn.sav
summary(popn.sav)
Geography v2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
Geography v2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
read_sas(
here("workshops/ropeg/handouts/data", "popn2019.sas7bdat")
) -> popn.sas
summary(popn.sas)
Geography v2019
Length:930 Min. : 36.0
Class :character 1st Qu.: 446.8
Mode :character Median : 1421.0
Mean : 8176.7
3rd Qu.: 5905.5
Max. :898553.0
Notice that the year column is title v2019
in the SAS and SPSS files as read and _2019
in the Stata file as read. We will fix this in a little bit but the lesson here is that it is always good to have column names not start with a number; many software suites dislike that and automatically assume it is a character vector.
This is a useful facility because if you have to keep pulling updated versions of the same data or you don’t want to keep local copies of the data file. The fill I am going to grab is the master population data file we have been reading in.
library(readxl)
"https://www2.census.gov/programs-surveys/popest/tables/2010-2019/cities/totals/SUB-IP-EST2019-ANNRES-39.xlsx" -> url
"data/SUB-IP-EST2019-ANNRES-39.xlsx" -> destfile
curl::curl_download(url, destfile)
read_excel(destfile, skip = 3, n_max = 930) -> SUB_IP_EST2019_ANNRES_39
names(SUB_IP_EST2019_ANNRES_39)
[1] "...1" "Census" "Estimates Base"
[4] "2010" "2011" "2012"
[7] "2013" "2014" "2015"
[10] "2016" "2017" "2018"
[13] "2019"
Look at the column names. Terrible! Let us fix these with the colnames()
command.
colnames(SUB_IP_EST2019_ANNRES_39) = c(
"Geography", "Census_2010", "Estimates_base", "year_2010",
"year_2011", "year_2012", "year_2013", "year_2014",
"year_2015", "year_2016", "year_2017", "year_2018",
"year_2019"
)
names(SUB_IP_EST2019_ANNRES_39)
[1] "Geography" "Census_2010" "Estimates_base"
[4] "year_2010" "year_2011" "year_2012"
[7] "year_2013" "year_2014" "year_2015"
[10] "year_2016" "year_2017" "year_2018"
[13] "year_2019"
The important thing is to get the order of the column names right. That is, if I put "Geography"
anywhere but as the name of the first column my data will get messed up.
You can also rename individual columns by referencing their column positions
, for example, you could have done the following to rename only the first column, the second column, and the third column, respectively.
colnames(SUB_IP_EST2019_ANNRES_39)[1] = "geography"
colnames(SUB_IP_EST2019_ANNRES_39)[2] = "census2010"
colnames(SUB_IP_EST2019_ANNRES_39)[3] = "base"
names(SUB_IP_EST2019_ANNRES_39)
[1] "geography" "census2010" "base" "year_2010" "year_2011"
[6] "year_2012" "year_2013" "year_2014" "year_2015" "year_2016"
[11] "year_2017" "year_2018" "year_2019"
If you work with data you know what this entails … making sure there are no missing values where there should be none, adding value labels to categorical variables, making sure the categories are ordinal where they should be ordinal, creating grouped frequency tables for age, income, etc, maybe transforming numerical variables into z-scores, collapsing some categories, making sure strings have the proper case (sentence-case or title-case), and so on. To work through these basic steps I will lean on a small data-set that is often used for teaching statistics – the hsb2
data-set with 200 observations on each student’s scaled scores on five standardized tests, and then their race/ethnicity, sex, whether they attended a private or public school, and so on.
read.table(
'https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE,
sep = ","
) -> hsb2
read_csv(here("workshops/ropeg/handouts/data", "hsb2.csv")) -> hsb2
library(tidyverse)
glimpse(hsb2)
Rows: 200
Columns: 11
$ id <dbl> 70, 121, 86, 141, 172, 113, 50, 11, 84, 48, 75, 60,…
$ female <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ race <dbl> 4, 4, 4, 4, 4, 4, 3, 1, 4, 3, 4, 4, 4, 4, 3, 4, 4, …
$ ses <dbl> 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 1, 1, 3, …
$ schtyp <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ prog <dbl> 1, 3, 1, 3, 2, 2, 1, 2, 1, 2, 3, 2, 2, 2, 2, 1, 2, …
$ read <dbl> 57, 68, 44, 63, 47, 44, 50, 34, 63, 57, 60, 57, 73,…
$ write <dbl> 52, 59, 33, 44, 52, 52, 59, 46, 57, 55, 46, 65, 60,…
$ math <dbl> 41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71,…
$ science <dbl> 47, 63, 58, 53, 53, 63, 53, 39, 58, 50, 53, 63, 61,…
$ socst <dbl> 57, 61, 31, 56, 61, 61, 61, 36, 51, 51, 61, 61, 71,…
Notice the absence of value labels. Luckily we know what these should be on the categorical variables:
female
= (0 = male; 1 = female)race
= (1 = hispanic 2 = asian 3 = african-american 4 = white)ses
= socioeconomic status (1 = low 2 = middle 3 = high)schtyp
= type of school (1 = public 2 = private)prog
= type of program (1 = general 2 = academic 3 = vocational)read
= standardized reading scorewrite
= standardized writing scoremath
= standardized math scorescience
= standardized science scoresocst
= standardized social studies scoreNow we can add these labels to each variable. When we do this we create what R calls factors
– categorical variables with value levels
(0, 1, 2, etc) and labels
(male, female, etc).
Rather than overwrite each variable I am going to assign new names to each factor I create by appending .f
to each original variable name. This is a good policy to adopt because in case something gets messed up you still have the original variables to rework.
Notice the sequence here: factor(dataset$variable, levels, labels)
Let us finish the rest:
factor(hsb2$race,
levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White")
) -> hsb2$race.f
factor(hsb2$ses,
levels = c(1:3),
labels = c("Low", "Middle", "High")
) -> hsb2$ses.f
factor(hsb2$schtyp,
levels = c(1:2),
labels = c("Public", "Private")
) -> hsb2$schtyp.f
factor(hsb2$prog,
levels = c(1:3),
labels = c("General", "Academic", "Vocational")
) -> hsb2$prog.f
Now we could drop the original columns that lack the value labels. How? Well, one way to do it would be to simply list the columns we want by noting and using the column numbers. Run the names(hsb2)
command and you will see each column’s position.
names(hsb2)
[1] "id" "female" "race" "ses" "schtyp" "prog"
[7] "read" "write" "math" "science" "socst" "female.f"
[13] "race.f" "ses.f" "schtyp.f" "prog.f"
Okay, so I do not want columns 2 through 6 and I want all the rows. I can specify this by listing the rows and columns I want:
[1] "id" "read" "write" "math" "science" "socst"
[7] "female.f" "race.f" "ses.f" "schtyp.f" "prog.f"
One of the common tasks with categorical measures tends to be the need to organize the levels of the factor so that they retain their ordinal property. Let us assume there is the following hierarchy (in descending order) of the levels of prog.f
: Academic > General > Vocational. We can enforce this ordinal structure as follows:
This is a reasonable solution when you are basically making frequency tables, cross-tabulations, charts, etc.3
Let us look at some simple frequency tables. Mind you, there are several ways to build these but I will lean on what I think is the easiest for those new to R: With the janitor
package.
library(janitor)
tabyl(myhsb$prog.ordered) %>%
adorn_totals(where = "row") %>% # sum the rows and show the result
adorn_pct_formatting(digits = 0, affix_sign = TRUE) # show percentages with %
myhsb$prog.ordered n percent
Vocational 50 25%
General 45 22%
Academic 105 52%
Total 200 100%
Now say I wanted a simple cross-tabulation of prog.ordered
by female.f
.
myhsb %>%
tabyl(female.f, prog.ordered) # the cross-tabulation
female.f Vocational General Academic
Male 23 21 47
Female 27 24 58
This is a simple table with no row or column totals or percentages. Let us add row
and col
totals.
myhsb %>%
tabyl(female.f, prog.ordered) %>% # the cross-tabulation
adorn_totals(where = c("row", "col")) # row & column totals
female.f Vocational General Academic Total
Male 23 21 47 91
Female 27 24 58 109
Total 50 45 105 200
Ideally you will want to show the frequencies and the percentages. Say we go with column percentages.
myhsb %>%
tabyl(female.f, prog.ordered) %>% # the cross-tabulation
adorn_totals(where = c("row", "col")) %>% # row & column totals
adorn_percentages("col") %>% # column percentages
adorn_pct_formatting() %>%
adorn_ns() %>%
adorn_title("combined")
female.f/prog.ordered Vocational General Academic
Male 46.0% (23) 46.7% (21) 44.8% (47)
Female 54.0% (27) 53.3% (24) 55.2% (58)
Total 100.0% (50) 100.0% (45) 100.0% (105)
Total
45.5% (91)
54.5% (109)
100.0% (200)
Often you have age, income, standardized test scores, etc. that you want to flip into a grouped frequency table. This is easily done with the santoku
package. Now, bear in mind that we can construct the groups in various ways so let us start with fixed-width intervals with chop_evenly
library(santoku)
chop_evenly(myhsb$read, interval = 5) -> myhsb$read.evenly
tabyl(myhsb$read.evenly) %>% # class-width is exactly 9.6
adorn_pct_formatting()
myhsb$read.evenly n percent
[28, 37.6) 14 7.0%
[37.6, 47.2) 68 34.0%
[47.2, 56.8) 48 24.0%
[56.8, 66.4) 50 25.0%
[66.4, 76] 20 10.0%
Maybe you want equal-sized groups with chop_equally
?
chop_equally(myhsb$read, groups = 5) -> myhsb$read.equally
tabyl(myhsb$read.equally) %>% # Here you have quintiles
adorn_pct_formatting()
myhsb$read.equally n percent
[0%, 20%) 39 19.5%
[20%, 40%) 16 8.0%
[40%, 60%) 62 31.0%
[60%, 80%) 37 18.5%
[80%, 100%] 46 23.0%
No? You wanted the quartiles
instead?
chop_equally(myhsb$read, groups = 4) -> myhsb$read.quartiles
tabyl(myhsb$read.quartiles) %>% # Here you have quartiles
adorn_pct_formatting()
myhsb$read.quartiles n percent
[0%, 25%) 39 19.5%
[25%, 50%) 44 22.0%
[50%, 75%) 61 30.5%
[75%, 100%] 56 28.0%
Perhaps we had certain breaks in mind, for example, reading score below 40 is “Low”, 40-60 is “Medium”, and above 60 is “High”
chop(myhsb$read, breaks = c(40, 60)) -> myhsb$read.breaks
tabyl(myhsb$read.breaks) %>% # Here you have specified breaks
adorn_pct_formatting()
myhsb$read.breaks n percent
[28, 40) 22 11.0%
[40, 60) 122 61.0%
[60, 76] 56 28.0%
Note that when you create the groups, you have to be sure of how values are being included if they fall on the boundary. For example, if read = 60, should it be in the 40-60 group or 60-80 group? Turns out the default here is to include it in the 60-80 group.4
You may need to operate on some variables, converting them into z-scores, multiplying them by some number, taking the natural logarithm, and so on.
You will often want means, medians, standard deviations, minimum/maximum, maybe quartiles, etc. Again, there are many ways to do this but the easiest way might be with the skimr
package.
Name | myhsb |
Number of rows | 200 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
factor | 10 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
female.f | 0 | 1 | FALSE | 2 | Fem: 109, Mal: 91 |
race.f | 0 | 1 | FALSE | 4 | Whi: 145, His: 24, Afr: 20, Asi: 11 |
ses.f | 0 | 1 | FALSE | 3 | Mid: 95, Hig: 58, Low: 47 |
schtyp.f | 0 | 1 | FALSE | 2 | Pub: 168, Pri: 32 |
prog.f | 0 | 1 | FALSE | 3 | Aca: 105, Voc: 50, Gen: 45 |
prog.ordered | 0 | 1 | TRUE | 3 | Aca: 105, Voc: 50, Gen: 45 |
read.evenly | 0 | 1 | FALSE | 5 | [37: 68, [56: 50, [47: 48, [66: 20 |
read.equally | 0 | 1 | FALSE | 5 | [40: 62, [80: 46, [0%: 39, [60: 37 |
read.quartiles | 0 | 1 | FALSE | 4 | [50: 61, [75: 56, [25: 44, [0%: 39 |
read.breaks | 0 | 1 | FALSE | 3 | [40: 122, [60: 56, [28: 22 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 100.50 | 57.88 | 1.00 | 50.75 | 100.50 | 150.25 | 200.00 | ▇▇▇▇▇ |
read | 0 | 1 | 52.23 | 10.25 | 28.00 | 44.00 | 50.00 | 60.00 | 76.00 | ▂▇▆▆▂ |
write | 0 | 1 | 52.77 | 9.48 | 31.00 | 45.75 | 54.00 | 60.00 | 67.00 | ▂▅▆▇▇ |
math | 0 | 1 | 52.65 | 9.37 | 33.00 | 45.00 | 52.00 | 59.00 | 75.00 | ▃▆▇▅▂ |
science | 0 | 1 | 51.85 | 9.90 | 26.00 | 44.00 | 53.00 | 58.00 | 74.00 | ▂▅▇▇▂ |
socst | 0 | 1 | 52.41 | 10.74 | 26.00 | 46.00 | 52.00 | 61.00 | 71.00 | ▂▃▇▇▃ |
math.halved | 0 | 1 | 26.32 | 4.68 | 16.50 | 22.50 | 26.00 | 29.50 | 37.50 | ▃▆▇▅▂ |
math.squared | 0 | 1 | 2858.82 | 1016.05 | 1089.00 | 2025.00 | 2704.00 | 3481.00 | 5625.00 | ▆▇▇▃▂ |
math.ln | 0 | 1 | 3.95 | 0.18 | 3.50 | 3.81 | 3.95 | 4.08 | 4.32 | ▁▆▇▇▃ |
math.sqrt | 0 | 1 | 7.23 | 0.64 | 5.74 | 6.71 | 7.21 | 7.68 | 8.66 | ▂▆▇▆▂ |
math.zscore | 0 | 1 | 0.00 | 1.00 | -2.10 | -0.82 | -0.07 | 0.68 | 2.39 | ▃▆▇▅▂ |
Once we have cleaned and otherwise modified our data we usually want to store it in a native R-data format. There are a couple of ways to do this but the simplest is to save it as an *.RData
or *.rdata
format. I will use the former file-extension (my habit).
Check your data folder and you will see this file we just saved.
Sometimes you may want to output the data or share it with some as a csv
, xlsx
, or SPSS file. No problem.5
When we start working on data visualizations we will see how to load our myhsb.RData
file.
In fact, R will also read files directly from the web, and even download and expand compressed (*.zip
or *.tar.gz
) files.↩︎
Check out sas7bdat
, foreign
, and the rio
libraries if you have trouble with haven.
↩︎
But if you need to use an ordinal measure in a regression-type setting, it will not work as expected because R treats an ordered factor as warranting a polynomial.↩︎
By default, intervals are left-closed
, i.e. they include their left endpoints. If you want right-closed intervals, set left = FALSE
.↩︎
Other file formats are possible as well, including but not limited to Stata and SAS.↩︎