We begin by installing the tools we will need – R and RStudio. Why two tools? Chester Ismay, a prominent R developer, describes it thus:
“R is like the engine of your car, and RStudio is the dashboard of your car.”
Every time you need to use R/RStudio, this is the shortcut/icon you will click. If everything installs without a hitch, look for the RStudio shortcut/icon on your machine.
Go ahead and launch RStudio. The following screenshot is how RStudio looks on my screen at the time of assembling this material.
Panes can be detached. This is very helpful when you want another application next to the pane or behind it, or if you are using multiple monitors since then you can execute commands in one monitor and watch the output in another monitor. You see three window panes. Each serves an important purpose so let us look at some core functions nested in these panes.
bookdown
to write a “book”, and so onYou can customize the panes via Tools -> Global Options...
What are packages, or libraries as they are also called? Well, once again, Chester Ismay describes them thus:
“The latest version of R is like the latest smartphone you bought, and the libraries/packages are like the apps you installed on your smartphone to enhance its functionality.”
Now we install some packages via Tools -> Install Packages...
The initial list of packages to be installed is shown below. Other packages will be installed as needed. Copy-and-paste the following commands into the R Console prompt and hit
my.pkgs <- c("devtools", "reshape2", "tidyverse", "lubridate", "Hmisc",
"gapminder", "leaflet", "DT", "data.table", "htmltools",
"scales", "ggridges", "here", "knitr", "kableExtra", "haven",
"readxl", "ggthemes", "janitor")
install.packages(my.pkgs)
It is a good idea to update packages on a regular frequency but every now and then something might break with an update but it is usually fixed sooner rather than later by the developer. You can update packages via Tools -> Check for Package Updates...
ouir
data
. The folder structure will now be as shown belowouir/
└── session01.Rmd
└── session02.Rmd
└── session03.Rmd
└── data/
└── some data file
└── another data file
All data you download or create go into the data
folder. All R code files reside in the ouir
folder. Open the Rmd file I sent you: ouir-day01.Rmd and save it in the ouir folder. Save the data I sent you in the data folder.
project
via File -> New Project
and choose Existing Directory
. Browse to the ouir folder and click Create Project
. RStudio will restart and when it does you will be in the project folder and will see a file called ouir.Rproj
RStudio projects make it straightforward to divide your work into multiple contexts, each with their own working directory, workspace, history, and source documents. There are several options that can be set on a per-project basis to customize the behavior of RStudio. You can edit these options using the Project Options command on the Project menu. From now on, when you start an RStudio session and want to work on the materials developed for this workshop, do so by clicking the ouir.Rproj
file or icon. Trust me, this makes working with R/RStudio a lot easier even for advanced R users.
Of course, if you are going to work on something else, for work perhaps, you should create a folder, name it, create the data sub-folder, and then create a project, much as we did here, and use that *.Rproj file. I create projects for everything of any consequences.
Let us shutdown RStudio now. As you do this, if you are asked whether you want to save the workspace, etc., always say “No” otherwise you will end up with a very cluttered Environment and your machine will slowdown as well.
New File -> R Markdown ...
and enter My First Rmd File
in title and your name
.OK
File -> Save As..
and save it as testing_rmd
in the ouir sub-folder and click the Knit to html
buttonYou may see a message that says some packages need to be installed/updated. Allow these to be installed/updated.
If all goes well, and the document knits, you should see an html file that has some code, a plot and other results. As the document knits, watch for error messages and copy these verbatim since we can hunt for solutions if we know the error message word for word.
If you need to create PDF documents, then you will need a working LaTeX setup on your machine. There are other ways to setup a LaTeX system but the easiest might be to run the following code:
install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()
Now restart RStudio and this time try to knit to PDF
and then shutdown RStudio once again.
Golden Rule: Give every code chunk a unique name, which can be a alphanumeric string with no whitespace. If you forget, use the namer()
package to assign names to every code chunk sans a name. This can be done via
library(namer)
name_chunks("myfilename.Rmd")
You will see the code chunks have several options that could be invoked. Here are some of the more common ones we will use.
Other options can be found in the cheatsheet available here.
Data will generally mirror one of the following types …integer, numeric/double, character, logical, date, or a factor
library(tibble)
library(lubridate)
data_frame(
variable1 = c(1L, 2L, 3L, 4L),
variable2 = c(2.1, 3.4, 5.6, 7.8),
variable4 = c("Low", "Medium", "High", "Missing"),
variable5 = c(TRUE, FALSE, FALSE, TRUE),
variable6 = ymd(c("2017-05-23", "1776/07/04",
"1983-05/31", "1908/04-01")),
variable7 = as.factor(c("Male", "Female", "Trans", "Trans"))
)
## # A tibble: 4 x 6
## variable1 variable2 variable4 variable5 variable6 variable7
## <int> <dbl> <chr> <lgl> <date> <fct>
## 1 1 2.1 Low TRUE 2017-05-23 Male
## 2 2 3.4 Medium FALSE 1776-07-04 Female
## 3 3 5.6 High FALSE 1983-05-31 Trans
## 4 4 7.8 Missing TRUE 1908-04-01 Trans
Check out the lubridate package if you need to work with dates and time intervals. A date
variable has a very specific meaning for R; the data point must reflect a year, a month, and a day before it is deemed a valid date format
.
Make sure you have the following data-sets in the data folder. If you don’t, then the commands that follow will not work. We start by reading a simple comma-separated variable
format file and then a tab-delimited variable
format file.
library(here) # loaded once per session
df.csv <- read.csv("data/ImportDataCSV.csv", sep=",", header=TRUE) # note sep = ","
df.tab <- read.csv("data/ImportDataTAB.txt", sep="\t", header=TRUE) # note sep = "\t"
If the files were read then the Environment
should show objects called df.csv
and df.tab
. If you don’t see these then run through the following checklist:
- Is the csv/txt files in your data folder? - Is the folder correctly named (no blank spaces before or after, all lowercase, etc)? - Is the data folder is inside ouir folder? - Are you in the our.Rproj?
Excel files can be read via the readxl
package.
library(readxl)
df.xls <- read_excel("data/ImportDataXLS.xls")
df.xlsx <- read_excel("data/ImportDataXLSX.xlsx")
SAS, SPSS, Stata files can be read with the haven
package.
library(haven)
df.stata <- read_stata("data/ImportDataStata.dta")
df.sas <- read_sas("data/ImportDataSAS.sas7bdat")
df.spss <- read_sav("data/ImportDataSPSS.sav")
Fixed-width files: It is also common to encounter fixed-width files where the raw data are stored without any gaps between successive variables. However, these files will come with documentation that will tell you where each variable starts and ends, along with other details about each variable.
df.fw <- read.fwf("data/fwfdata.txt", widths = c(4, 9, 2, 4), header = FALSE,
col.names = c("Name", "Month", "Day", "Year"))
Notice we need widths = c()
and col.names = c()
widths
specifies how many slots each variable/field occupiescol.names
indicates the names to be assigned to each variable/fieldIt is possible to specify the full web-path for a file and read it in, rather than storing a local copy. This is often useful when updated by the source (Census Bureau, Bureau of Labor, Bureau of Economic Analysis, etc.)
fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
test <- read.table("https://stats.idre.ucla.edu/stat/data/test.txt",
header = TRUE)
test.csv <- read.csv("https://stats.idre.ucla.edu/stat/data/test.csv",
header = TRUE)
library(foreign)
hsb2.spss <- read.spss("https://stats.idre.ucla.edu/stat/data/hsb2.sav")
df.hsb2.spss <- as.data.frame(hsb2.spss)
hsb2.spss
was read with the foreign
package2, an alternative to haven
foreign
calls read.spss
while haven
calls read_spss
The foreign
package will also read Stata and other formats. I end up defaulting to haven
now. There are other packages for reading SPSS, SAS, etc. files … sas7bdat
, rio
, data.table
, xlsx
, XLConnect
, gdata
and others.
temp <- tempfile()
download.file("ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/
Datasets/NVSS/bridgepop/2016/pcen_v2016_y1016.sas7bdat.zip",
temp)
oursasdata <- haven::read_sas(unz(temp, "pcen_v2016_y1016.sas7bdat"))
unlink(temp)
You can save your data in a format that R will recognize, giving it the RData
or rdata
extension
save(oursasdata, file = "data/oursasdata.RData")
save(oursasdata, file = "data/oursasdata.rdata")
Check your data directory to confirm both files are present.
Working with the hsb2 data: 200 students from the High school and Beyond study
hsb2 <- read.table('https://stats.idre.ucla.edu/stat/data/hsb2.csv',
header = TRUE, sep = ",")
female
= (0/1)race
= (1=hispanic 2=asian 3=african-amer 4=white)ses
= socioeconomic status (1=low 2=middle 3=high)schtyp
= type of school (1=public 2=private)prog
= type of program (1=general 2=academic 3=vocational)read
= standardized reading scorewrite
= standardized writing scoremath
= standardized math scorescience
= standardized science scoresocst
= standardized social studies scoreThere are no label values for the qualitative/categorical variables (female, race, ses, schtyp, and prog) so we create these (with base R commands).
hsb2$female.f <- factor(hsb2$female,
levels = c(0, 1),
labels = c("Male", "Female"))
hsb2$race.f <- factor(hsb2$race,
levels = c(1:4),
labels = c("Hispanic", "Asian", "African American", "White"))
hsb2$ses.f <- factor(hsb2$ses,
levels = c(1:3),
labels = c("Low", "Middle", "High"))
hsb2$schtyp.f <- factor(hsb2$schtyp,
levels = c(1:2),
labels = c("Public", "Private"))
hsb2$prog.f <- factor(hsb2$prog,
levels = c(1:3),
labels = c("General", "Academic", "Vocational"))
Having added labels to the factors in hsb2 we can now save the data for later use.
save(hsb2, file = "data/hsb2.RData")
You save your data via
save(dataname, file = "filepath/filename.RData")
or save(dataname, file = "filepath/filename.rdata")
data(mtcars)
save(mtcars, file = "data/mtcars.RData")
rm(list = ls())# To clear the Environment
load("data/mtcars.RData")
You can also save multiple data files as follows:
data(mtcars)
library(ggplot2)
data(diamonds)
save(mtcars, diamonds, file = "data/mydata.RData")
rm(list = ls()) # To clear the Environment
load("data/mydata.RData")
If you want to save just a single object
from the environment and then load it in a later session, maybe with a different name, then you should use saveRDS()
and readRDS()
data(mtcars)
saveRDS(mtcars, file = "data/mydata.RDS")
rm(list = ls()) # To clear the Environment
ourdata = readRDS("data/mydata.RDS")
If instead you did the following, the file will be read with the original name
even though you called it with ourdata
data(mtcars)
save(mtcars, file = "data/mtcars.RData")
rm(list = ls()) # To clear the Environment
ourdata = load("data/mtcars.RData") # Note ourdata is listed as "mtcars"
If you want to save everything you have done in the work session you can via save.image()
save.image(file = "mywork_jan182018.RData")
summary(dataname)
will give you a snapshot of your data
glimpse(dataname)
does the same if you are using the tidyverse
library
dim(dataname)
will give you the dimensions of the data frame
str(dataname)
will give you the structure of the data frame … each variable’s type and other details
names(dataname)
will give you the names of all columns as well as the column position (i.e., number)
head(dataname, x)
will give you the first \(x\) rows of the data frame
tail(dataname, x)
will give you the last \(x\) rows of the data frame
clean_names(dataname)
from the janitor
package will clean up messy column names (i.e., ensuring that all column names are lowercase and have no blank spaces, etc)
mean(varname, na.rm = TRUE)
will give you the mean of a variable
median(varname, na.rm = TRUE)
will give you the median of a variable
sd(varname, na.rm = TRUE)
will give you the standard deviation of a variable
var(varname, na.rm = TRUE)
will give you the variance of a variable
min(varname, na.rm = TRUE)
will give you the minimum of a variable
max(varname, na.rm = TRUE)
will give you the maximum of a variable
quantile(varname, p = c(0.25, 0.75), na.rm = TRUE)
will give you the first and third quartiles of a variable
scale(varname, na.rm = TRUE)
will give you z-score of a variable
Note that na.rm = TRUE
drops all cases with missing values before calculating quantities of interest. If you forget this switch you will get nothing or worse, see an error message.