The data-sets we want are for Egypt, and two in particular – (1) the 2014 DHS-VI (Standard DHS), and (2) the 2015 DHS-VII (Special). These can be downloaded in two ways, either by logging in and manually selecting and downloading the data-set you want and in the format you want (flat/SPSS/SAS/Stata), or then by using the lodown
package. The lodown
route is the more efficient way to go, and the one I take below.
library(lodown)
dhs_cat <- get_catalog("dhs",
output_dir = "data/DHS",
your_email = "ruhil@ohio.edu",
your_password = "eMg-H5G-VAr-d7H",
your_project = "Fullbright Junior Scholar Training"
)
dhs_cat14 <- subset(dhs_cat, country == 'Egypt' & year == 2014)
lodown("dhs", dhs_cat14, output_dir = "data/DHS",
your_email = "ruhil@ohio.edu",
your_password = "eMg-H5G-VAr-d7H",
your_project = "Fullbright Junior Scholar Training")
dhs_cat15 <- subset(dhs_cat, country == 'Egypt' & year == 2015)
lodown("dhs", dhs_cat15, output_dir = "data/DHS",
your_email = "ruhil@ohio.edu",
your_password = "eMg-H5G-VAr-d7H",
your_project = "Fullbright Junior Scholar Training")
If you look inside the data/DHS
folder you should see a few data-sets with the *.rds
extension. You will have to choose which one of these you want to work with, for example, the household file for 2014 is called EGHR61FL.rds
.
All of this assumes, of course, that you have gone over the recode manuals and maps for each of these survey years. If you have not done that prepwork, then you should do so now. The links to the recode manuals and recode maps are given below:
You will also want to have handy a copy of the Final Report for the country. The 2014 Final Report (English) is available here and the 2015 Final Report (English) is available here
Let us read in a specific data-set, the 2014 DHS file for women.
[1] 21762 4095
We have 21,762 observations (rows) and some 4095 variables (columns). All the more reason to really know what variables are of interest to us.
Say I am interested in the age distribution. This happens to be variable v013
but the responses have to be labeled with the age-groups each represents. This can be done as shown below, and note that I am creating a new variable called agecat
rather than overwriting the original variable.
library(sjmisc)
dhs.df14$agecat <- to_label(dhs.df14$v013) # Labeling the values
table(dhs.df14$agecat) # Check the frequencies
15-19 20-24 25-29 30-34 35-39 40-44 45-49
738 3051 4718 4133 3473 2902 2747
Let us also create a labeled variable that flags the woman’s educatonal attainment.
dhs.df14$educat <- to_label(dhs.df14$v149) # Labeling the values
table(dhs.df14$educat) # Check the frequencies
no education incomplete primary complete primary
4861 1239 940
incomplete secondary complete secondary higher
2935 8595 3192
How does educational attainment vary by age?
table(dhs.df14$agecat, dhs.df14$educat)
no education incomplete primary complete primary
15-19 70 65 46
20-24 330 99 110
25-29 705 149 125
30-34 806 234 229
35-39 841 213 209
40-44 958 219 98
45-49 1151 260 123
incomplete secondary complete secondary higher
15-19 364 185 8
20-24 606 1515 391
25-29 590 2186 963
30-34 422 1710 732
35-39 422 1273 515
40-44 356 965 306
45-49 175 761 277
Maybe you don’t want to use these expansive educational attainment categories and instead rely on fewer categories. One such recoding is shown below.
library(tidyverse)
dhs.df14 <- dhs.df14 %>%
mutate(
educat2 = case_when(
v149 == 0 ~ "No education",
v149 == 1 ~ "Some primary",
v149 == 2 ~ "Primary Complete/Some secondary",
v149 == 3 ~ "Primary Complete/Some secondary",
v149 == 4 ~ "Secondary complete/higher",
v149 == 5 ~ "Secondary complete/higher"),
educat2 = ordered(educat2, levels = c("No education", "Some primary", "Primary Complete/Some secondary", "Secondary complete/higher"))
)
Let us generate a few more variables, and clean them up a bit.
dhs.df14$urban.rural <- to_label(dhs.df14$v025) # urban-vs-rural residence
dhs.df14$wealthq <- to_label(dhs.df14$v190) # wealth quintiles
dhs.df14$governorates <- to_label(dhs.df14$sgov) # one measure of governorates
table(dhs.df14$governorates)
urban governorates cairo alexandria
0 1189 737
port said suez lower egypt
800 941 0
damietta dakahlia sharkia
986 955 1011
kalyubia kafr el-sheikh gharbia
850 945 835
menoufia behera ismailia
855 1088 859
upper egypt giza beni suef
0 1076 875
fayoum menya assuit
843 858 965
souhag qena aswan
913 1055 886
luxor frontier governorates red sea
905 0 387
new valley matroh
443 505
dhs.df14$governorates <- stringr::str_to_title(dhs.df14$governorates) # changing labels from lowercase to titlecase
dhs.df14$governorates <- factor(dhs.df14$governorates) # Making sure this variable has a unique numeric code for each unique value
dhs.df14$governorates <- droplevels(dhs.df14$governorates) # Drop levels with 0 frequencies
table(dhs.df14$governorates) # Checking frequencies
Alexandria Assuit Aswan Behera
737 965 886 1088
Beni Suef Cairo Dakahlia Damietta
875 1189 955 986
Fayoum Gharbia Giza Ismailia
843 835 1076 859
Kafr El-Sheikh Kalyubia Luxor Matroh
945 850 905 505
Menoufia Menya New Valley Port Said
855 858 443 800
Qena Red Sea Sharkia Souhag
1055 387 1011 913
Suez
941
dhs.df14$nchild <- dhs.df14$v201 # Total number of children ever born. If there are fewer than twenty births then this is the same as V224 (Number of entries in the birth history), but if there are more than twenty births then this gives the full number, while V224 will be 20
dhs.df14$currently.working <- to_label(dhs.df14$v014) # Whether the respondent is currently working
dhs.df14$region <- factor(
dhs.df14$sgov2,
levels = c(0, 1, 4, 7),
labels = c("Urban Governorates", "Lower Egypt", "Upper Egypt", "Frontier Governorates")
)
The DHS surveys have a complex survey design that is necessary to allow your survey estimates to be generalized to specific entities. This is what the documentation says:
All ever-married women age 15-49 who were usual members of the selected households and those who spent the night before the survey in the selected households were eligible to be interviewed in the survey. The sample for the 2014 EDHS was designed to provide estimates of population and health indicators including fertility and mortality rates for the country as a whole and for six major subdivisions (Urban Governorates, urban Lower Egypt, rural Lower Egypt, urban Upper Egypt, rural Upper Egypt, and the Frontier Governorates).
… Due to the non-proportional allocation of the 2014 EDHS sample to different governorates and to urban and rural areas as well as to the differences in response rates, sampling weights are required for any analysis using the survey data to ensure the actual representativeness of the survey results at national level as well as at the domain level.
Because we have to use sampling weights, we will first have to identify and define the elements that will go into the weighting. These elements will be the primary sampling unit (psu)
which is measured by v021
, the strata
indicator, measured by v022
, and the weight
assigned to each woman respondent measured by v005
.
dhs.df14$weight <- dhs.df14$v005
dhs.df14$psu <- dhs.df14$v021
dhs.df14$strata <- dhs.df14$v022
Now we tell R how to use these elements to weight the data. This is done with the srvyr
package. Note that before I specify the survey design I am retaining only a few variables for weighting.
library(survey)
library(srvyr)
dhs_egypt14 <- dhs.df14 %>%
select(
c("psu", "strata", "weight", "educat", "educat2",
"wealthq", "agecat", "urban.rural", "nchild",
"governorates", "region", "currently.working")
)
dhs14_design <- svydesign(~psu,
strata = ~strata,
weights = ~weight,
data = dhs_egypt14)
Now R knows how to weight any analysis we carry out. Let us start with a simple example. I want some weighted estimates, just so I can be confident that our code is working. One way to check this would be to look at the estimates in Table 3.1 (page 28) of the DHS 2014 Final Report.
tab.1 <- svymean(~ agecat, dhs14_design, na.rm = TRUE)
tab.1 <- as.data.frame(tab.1) * 100
tab.1 <- round(tab.1, digits = 1)
tab.1
mean SE
agecat15-19 3.5 0.2
agecat20-24 14.0 0.3
agecat25-29 21.8 0.3
agecat30-34 19.0 0.3
agecat35-39 16.1 0.3
agecat40-44 13.2 0.3
agecat45-49 12.4 0.3
tab.2 <- svymean(~ urban.rural, dhs14_design, na.rm = TRUE)
tab.2 <- as.data.frame(tab.2) * 100
tab.2 <- round(tab.2, digits = 1)
tab.2
mean SE
urban.ruralurban 35 0.8
urban.ruralrural 65 0.8
tab.3 <- svymean(~ region, dhs14_design, na.rm = TRUE)
tab.3 <- as.data.frame(tab.3) * 100
tab.3 <- round(tab.3, digits = 1)
tab.3
mean SE
regionUrban Governorates 12.7 0.7
regionLower Egypt 49.0 0.8
regionUpper Egypt 37.4 0.8
regionFrontier Governorates 0.9 0.1
tab.4 <- svymean(~ educat2, dhs14_design, na.rm = TRUE)
tab.4 <- as.data.frame(tab.4) * 100
tab.4 <- round(tab.4, digits = 1)
tab.4
mean SE
educat2No education 24.0 0.6
educat2Some primary 6.1 0.2
educat2Primary Complete/Some secondary 17.4 0.4
educat2Secondary complete/higher 52.4 0.8
tab.5 <- svymean(~ wealthq, dhs14_design, na.rm = TRUE)
tab.5 <- as.data.frame(tab.5) * 100
tab.5 <- round(tab.5, digits = 1)
tab.5
mean SE
wealthqpoorest 17.9 0.7
wealthqpoorer 19.7 0.5
wealthqmiddle 22.2 0.6
wealthqricher 20.9 0.8
wealthqrichest 19.4 0.7
These were weighted relative frequency distributions for a single variable. How about checking a few against Table 3.2? Here we have the fuller educational attainment variable (educat)
broken down by age, urban-rural, and so on. Let us do it by agecat
and wealthq
.
tab.6 <- svyby(~ educat, ~urban.rural, dhs14_design, svymean, na.rm = TRUE)
tab.6 <- tab.6[, 2:7] * 100
tab.6 <- round(tab.6, digits = 1)
tab.6
educatno education educatincomplete primary
urban 13.8 4.6
rural 29.6 7.0
educatcomplete primary educatincomplete secondary
urban 4.6 12.2
rural 3.9 13.9
educatcomplete secondary educathigher
urban 42.1 22.7
rural 36.6 9.1
tab.7 <- svyby(~ educat, ~wealthq, dhs14_design, svymean, na.rm = TRUE)
tab.7 <- tab.7[, 2:7] * 100
tab.7 <- round(tab.7, digits = 1)
tab.7
educatno education educatincomplete primary
poorest 47.7 8.6
poorer 37.4 8.9
middle 18.9 6.1
richer 14.0 5.1
richest 5.4 2.3
educatcomplete primary educatincomplete secondary
poorest 4.3 13.8
poorer 4.9 14.8
middle 3.7 14.6
richer 5.0 13.8
richest 2.8 9.4
educatcomplete secondary educathigher
poorest 22.7 3.0
poorer 30.4 3.6
middle 46.5 10.2
richer 44.8 17.3
richest 45.3 34.8
You notice how the row names and column names are, aesthetically speaking, less than desirable. For example, I would rather not see “wealthqpoorest” but rather see “Poorest”. Ditto for “Percent” and “Std. Error” in place of “mean” and “SE”. How can we do that?
Here we are mutating tab.5
by first making a column with the name of each row with names = rownames(.)
. We are then creating a new column called Quintile
after replacing the string “wealthq” with nothing with Quintile = gsub("wealthq", "", names)
. But the entries in Quintile
are all lowercase, so we end up making them titlecase with the code below.
tab.5$Quintile <- stringr::str_to_title(tab.5$Quintile)
We are almost done. What remains is to (a) make the Quintile column the first column, and then to change the names of the “mean” and “SE” columns. We do this as shown below:
tab.5 <- tab.5 %>%
select("Quintile", "mean", "SE") %>% # Choosing and ordering the columns we want to keep
rename("Percent" = mean,
"Std. Error" = SE)
tab.5
Quintile Percent Std. Error
1 Poorest 17.9 0.7
2 Poorer 19.7 0.5
3 Middle 22.2 0.6
4 Richer 20.9 0.8
5 Richest 19.4 0.7
Let us do another one, just for practice – tab.7
colnames(tab.7) <- c("No Education", "Incomplete Primary", "Completed Primary", "Incomplete Secondary", "Complete Secondary", "Higher")
tab.7 <- tab.7 %>%
mutate(
Quintile = row.names(.)
)
tab.7$Quintile <- stringr::str_to_title(tab.7$Quintile)
tab.7 <- tab.7 %>%
select("Quintile", "No Education", "Incomplete Primary", "Completed Primary", "Incomplete Secondary", "Complete Secondary", "Higher")
tab.7
Quintile No Education Incomplete Primary Completed Primary
1 Poorest 47.7 8.6 4.3
2 Poorer 37.4 8.9 4.9
3 Middle 18.9 6.1 3.7
4 Richer 14.0 5.1 5.0
5 Richest 5.4 2.3 2.8
Incomplete Secondary Complete Secondary Higher
1 13.8 22.7 3.0
2 14.8 30.4 3.6
3 14.6 46.5 10.2
4 13.8 44.8 17.3
5 9.4 45.3 34.8
Before we go any farther let us save our survey design data.