# load packages
library(tidyverse)
If you are present for the live session on Monday October 21st, please click here to take the survey.
All of the raw materials, including the R code and data, are available in the Github repository.
We will be using the tidyverse
and validate
packages in this tutorial. You may need to install them if you do not have them already. To install, run the following command in your R console: install.packages("tidyverse", "validate")
. Note that the tidyverse
package is large and may take a few minutes to install.
Code from the live session is avaible on the Github.
Introduction
What do we mean with data wrangling?
The Cambridge dictionary gives several meanings to the verb wrangle:
to argue with someone about something, especially for a long time
to take care of, control, or move animals, especially large animals such as cows or horses (mainly American English)
to move a person or thing somewhere, usually with difficulty or using force
to take care of or deal with something, usually when this is difficult
Data Wrangling
By data wrangling, we mean here the process of checking and correcting quality and integrity of data relevant to malaria modeling, prior to any further analysis. This is also known as data validation.
Data validation involves checking various aspects of your dataset, such as missing values, data types, outliers, and adherence to specific rules or constraints.
Validating our data helps maintain its quality and integrity, ensuring that any analyses or decisions made based on the data are robust and reliable.
Why Validate Data?
Ensure Data Integrity: Validating data helps identify and rectify errors, ensuring the integrity of the dataset.
Improve Analysis Accuracy: Clean and validated data leads to more accurate analysis and modeling results.
Compliance and Standards: Data validation ensures that the data conforms to predefined rules, standards, or regulatory requirements.
Error Prevention: Early detection of errors can prevent downstream issues and save time in troubleshooting.
Getting Started
Before you begin, you might want to create a new project in RStudio. This can be done by clicking on the “New Project” button in the upper right corner of the RStudio window. You can then name the project and choose a directory to save it in.
Next, we will load the tidyverse
package. This package provides a set of useful functions for data manipulation and visualization. We will use the ggplot2
package to create plots in the later section of this tutorial.
Next, let’s download the two example datasets we will use in this tutorial. These are available in the AMMnet Hackathon GitHub repository.
I suggest creating a data
folder inside your R project, then we can download the two example datasets so that they are saved to your computer.
# Create a data folder
dir.create("data")
# Download example data
<- "https://raw.githubusercontent.com/AMMnet/AMMnet-Hackathon/main/02_data-wrangle/data/"
url
download.file(paste0(url, "mockdata_cases1.csv"), destfile = "data/mockdata_cases1.csv")
download.file(paste0(url, "mosq_mock1.csv"), destfile = "data/mosq_mock1.csv")
# Load example data
<- read_csv("data/mockdata_cases1.csv")
data_cases <- read_csv("data/mosq_mock1.csv") mosq_data
The two datasets we will use are mockdata_cases1.csv
and mosq_mock1.csv
, which are mock example datasets that should be similar to malaria case surveillance and mosquito field collection data, respectively. In the following sections we will use the mockdata_cases1.csv
and mosq_mock1.csv
to introduce concepts of data cleaning and characterization in R.
1. Check the data for potential errors
Prevalence is a fraction defined in [0,1]
Note: Prevalence of 0 or 1 while not statistically erroneous, need checking for accuracy.
What observations have errors?
# Erroneous values for prevalence
%>%
data_cases::filter(prev <= 0 | prev >= 1) dplyr
# A tibble: 3 × 10
location month year ages total positive xcoord ycoord prev time_order_loc
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mordor 4 2018 15_a… 91 23 -20.0 30.5 25.3 4
2 neverwh… 2 2019 15_a… 22 -1 -20.8 29.6 -0.0455 14
3 neverwh… 3 2018 unde… 25 0 -19.8 30.2 0 3
Comment: We have two rows with nonsensical prev
data 25.3
and -0.455
, and one row with zero prev
at a given month.
Defensive programming
Note: The use of “::” enables us to call a function from a specific R package I have had instances where if “stats” base R package was called first, the filter function if not specified with the R package fails.
# Erroneous values for prevalence
%>%
data_cases::filter(prev < 0 | prev > 1) stats
We correct the two prevalence by re-calculating
Good practice to leave the original data intact (advantage of R over Stata)
# Update erroneous values for prevalence
<- data_cases%>%
data_prev ::mutate(prev_updated=positive/total) dplyr
We have a case erroneously reported with a negative value.
What are your options?
Never delete data
Query and have data management team make the necessary investigations and make a correction
%>%
data_prev::filter(prev_updated <= 0 | prev_updated >= 1) dplyr
# A tibble: 2 × 11
location month year ages total positive xcoord ycoord prev time_order_loc
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 neverwh… 2 2019 15_a… 22 -1 -20.8 29.6 -0.0455 14
2 neverwh… 3 2018 unde… 25 0 -19.8 30.2 0 3
# ℹ 1 more variable: prev_updated <dbl>
For now (in order to proceed with this demo), we drop the problematic observation.
Why is this not working?
# Filter erroneous values for prevalence, wrong way
<- data_prev%>%
data_use ::filter (prev_updated >= 0 | prev_updated <= 1) dplyr
Why is this working?
# Filter erroneous values for prevalence
<- data_prev%>%
data_use ::filter (prev_updated >= 0 )%>%
dplyr::filter (prev_updated <= 1)
dplyr
%>%
data_use::filter(prev_updated <= 0 | prev_updated >= 1) dplyr
# A tibble: 1 × 11
location month year ages total positive xcoord ycoord prev time_order_loc
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 neverwhere 3 2018 unde… 25 0 -19.8 30.2 0 3
# ℹ 1 more variable: prev_updated <dbl>
Schemas
To prevent nonsensical data appearing in your data, you should define a schema that comes along with your recorded data. A schema is a document that states rules for data types and values or ranges to be expected in a particular column of your data frame.
E.g. for prevalence, we know that this should be a real number between zero and one.
The R package validate
can be used to create a schema for your data frame:
# Filter erroneous values for prevalence
library(validate)
<- validate::validator(prev >= 0,
schema <= 1,
prev >= 0)
positive
<- validate::confront(data_cases, schema)
out summary(out)
name items passes fails nNA error warning expression
1 V1 514 513 1 0 FALSE FALSE prev - 0 >= -1e-08
2 V2 514 513 1 0 FALSE FALSE prev - 1 <= 1e-08
3 V3 514 513 1 0 FALSE FALSE positive - 0 >= -1e-08
Using the schema for the columns prev
and positive
, we could have readily detected the three problematic entries. For more details, you can have a look into the vignette of the validate
package.
Note: Next time when you receive data from your collaborators, you might want to ask them for the associated schema file (e.g. YAML format). Good luck!
2. Look at summary statistics
Summary stats by location (across all time points)
# Summary statistics
%>%
data_use::group_by(location)%>%
dplyr::summarise(nobs=n(),
dplyrmean_prev=mean(prev_updated),
min_prev=min(prev_updated),
max_prev=max(prev_updated))
# A tibble: 5 × 5
location nobs mean_prev min_prev max_prev
<chr> <int> <dbl> <dbl> <dbl>
1 mordor 105 0.314 0.158 0.488
2 narnia 104 0.326 0.08 0.488
3 neverwhere 95 0.301 0 0.486
4 oz 104 0.255 0.0714 0.459
5 wonderland 105 0.382 0.194 0.535
Summary stats by location and year (across all time points)
Table getting longer. Might be too cumbersome to add checks by month and age group Note: point of query - why just had 3 measurements in 2020?
# Summary statistics by location
%>%
data_use::group_by(location, year)%>%
dplyr::summarise(nobs=n(),
dplyrmean_prev=mean(prev_updated),
min_prev=min(prev_updated),
max_prev=max(prev_updated))
# A tibble: 15 × 6
# Groups: location [5]
location year nobs mean_prev min_prev max_prev
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 mordor 2018 36 0.318 0.206 0.473
2 mordor 2019 36 0.313 0.170 0.451
3 mordor 2020 33 0.312 0.158 0.488
4 narnia 2018 36 0.340 0.138 0.449
5 narnia 2019 36 0.361 0.216 0.488
6 narnia 2020 32 0.270 0.08 0.483
7 neverwhere 2018 36 0.304 0 0.45
8 neverwhere 2019 56 0.298 0.0370 0.486
9 neverwhere 2020 3 0.307 0.04 0.473
10 oz 2018 35 0.252 0.0714 0.459
11 oz 2019 36 0.254 0.0861 0.446
12 oz 2020 33 0.260 0.112 0.405
13 wonderland 2018 36 0.365 0.255 0.454
14 wonderland 2019 36 0.388 0.194 0.535
15 wonderland 2020 33 0.393 0.276 0.476
data_prev
and data_use
datasets
- Create a table showing the number of data entries per age group and location for each of them!
- Which age group and location have observations removed?
Slightly more advanced. Use of lists (not scope of the course but there is a point here).
# Summary statistics by location
<- data_use%>%
data_use_list ::group_split(location) dplyr
Or use the purrr
library:
# Summary statistics by location, map summary function
library(purrr)
<- purrr::map(.x=seq(length(data_use_list)),
data_use_age_summary .f=function(x){
%>%
data_use_list[[x]]::group_by(location,year,ages)%>%
dplyr::summarise(nobs=n(),
dplyrmean_prev=mean(prev_updated),
min_prev=min(prev_updated),
max_prev=max(prev_updated))
})
Now let’s focus on the first list object (mordor)
We know pregnant mothers, children <5 are most vulnerable.
Output (ages) isn’t ordered as we would want (chronologically).
# Summary statistics by location
<- data_use_age_summary[[1]]
data_mordor
data_mordor
# A tibble: 9 × 7
# Groups: location, year [3]
location year ages nobs mean_prev min_prev max_prev
<chr> <dbl> <chr> <int> <dbl> <dbl> <dbl>
1 mordor 2018 15_above 12 0.270 0.206 0.369
2 mordor 2018 5_to_14 12 0.335 0.219 0.427
3 mordor 2018 under_5 12 0.348 0.259 0.473
4 mordor 2019 15_above 12 0.266 0.170 0.377
5 mordor 2019 5_to_14 12 0.278 0.176 0.390
6 mordor 2019 under_5 12 0.394 0.315 0.451
7 mordor 2020 15_above 11 0.255 0.158 0.333
8 mordor 2020 5_to_14 11 0.352 0.258 0.488
9 mordor 2020 under_5 11 0.330 0.190 0.422
How to proceed?
# Summary statistics with age groups
<- c("under_5","5_to_14","15_above")
age_order
<- data_use
data_use_ordered
$age_group <- factor(data_use$ages, levels =age_order)
data_use_ordered
<- data_use_ordered%>%
data_mordor_reordered ::group_by(location, year,age_group)%>%
dplyr::summarise(nobs=n(),
dplyrmean_prev=mean(prev_updated),
min_prev=min(prev_updated),
max_prev=max(prev_updated))%>%
::filter(location=="mordor") dplyr
Let’s compare the two
# Compare for Mordor
data_mordor
# A tibble: 9 × 7
# Groups: location, year [3]
location year ages nobs mean_prev min_prev max_prev
<chr> <dbl> <chr> <int> <dbl> <dbl> <dbl>
1 mordor 2018 15_above 12 0.270 0.206 0.369
2 mordor 2018 5_to_14 12 0.335 0.219 0.427
3 mordor 2018 under_5 12 0.348 0.259 0.473
4 mordor 2019 15_above 12 0.266 0.170 0.377
5 mordor 2019 5_to_14 12 0.278 0.176 0.390
6 mordor 2019 under_5 12 0.394 0.315 0.451
7 mordor 2020 15_above 11 0.255 0.158 0.333
8 mordor 2020 5_to_14 11 0.352 0.258 0.488
9 mordor 2020 under_5 11 0.330 0.190 0.422
data_mordor_reordered
# A tibble: 9 × 7
# Groups: location, year [3]
location year age_group nobs mean_prev min_prev max_prev
<chr> <dbl> <fct> <int> <dbl> <dbl> <dbl>
1 mordor 2018 under_5 12 0.348 0.259 0.473
2 mordor 2018 5_to_14 12 0.335 0.219 0.427
3 mordor 2018 15_above 12 0.270 0.206 0.369
4 mordor 2019 under_5 12 0.394 0.315 0.451
5 mordor 2019 5_to_14 12 0.278 0.176 0.390
6 mordor 2019 15_above 12 0.266 0.170 0.377
7 mordor 2020 under_5 11 0.330 0.190 0.422
8 mordor 2020 5_to_14 11 0.352 0.258 0.488
9 mordor 2020 15_above 11 0.255 0.158 0.333
3. Use of graphs
We need to assess the evolution of prevalence for all regions by month
#Plotting evolution over time
<- ggplot2::ggplot(data=data_use_ordered,
evolution_plot mapping=aes(x=month,
y=prev_updated,
group=location,
colour=location))+
::geom_line(lwd=1.1)+
ggplot2::facet_wrap(~year)+
ggplot2::theme_bw()+
ggplot2::xlab("Month of the Year")+
ggplot2::ylab("Prevalence")+
ggplot2::scale_x_discrete(limits=factor(1:12),
ggplot2labels=c("J","F","M",
"A","M","J",
"J","A","S",
"O","N","D"))+
::scale_y_continuous(breaks=seq(from=0,
ggplot2to=0.7,
by=0.1))
evolution_plot
Observation: Prevalence graph with vertical lines per month and year, means we have several subgroups for prevalence data, we plot facets for levels of age_group
#Plotting evolution over time, fix 1
<- ggplot2::ggplot(data=data_use_ordered,
evolution_plot_ages mapping=aes(x=month,
y=prev_updated,
group=location,
colour=location))+
::geom_line(lwd=1.1)+
ggplot2::facet_wrap(age_group~year)+
ggplot2::theme_bw()+
ggplot2::xlab("Month of the Year")+
ggplot2::ylab("Prevalence")+
ggplot2::scale_x_discrete(limits=factor(1:12),
ggplot2labels=c("J","F","M",
"A","M","J",
"J","A","S",
"O","N","D"))+
::scale_y_continuous(breaks=seq(from=0,
ggplot2to=0.7,
by=0.1))
evolution_plot_ages
Observation: Some improvements, but we still have vertical lines, maybe we have other group variables. Let’s only look at those rows that have more than one entry per location, month, year, age_group
#Plotting evolution over time, fix 2
%>%
data_use_orderedgroup_by(location,month,year,age_group)%>%
tally()%>%
filter(n>1)%>%
left_join(data_use_ordered)
# A tibble: 48 × 13
# Groups: location, month, year [8]
location month year age_group n ages total positive xcoord ycoord
<chr> <dbl> <dbl> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 neverwhere 6 2019 under_5 2 under_5 24 4 -20.6 30.7
2 neverwhere 6 2019 under_5 2 under_5 26 1 -20.5 30.7
3 neverwhere 6 2019 5_to_14 2 5_to_14 27 5 -19.7 30.0
4 neverwhere 6 2019 5_to_14 2 5_to_14 27 8 -19.3 30.2
5 neverwhere 6 2019 15_above 2 15_above 70 31 -19.4 29.4
6 neverwhere 6 2019 15_above 2 15_above 74 27 -19.2 29.2
7 neverwhere 7 2019 under_5 2 under_5 25 5 -20.0 29.1
8 neverwhere 7 2019 under_5 2 under_5 26 4 -20.7 28.6
9 neverwhere 7 2019 5_to_14 2 5_to_14 27 7 -18.8 29.3
10 neverwhere 7 2019 5_to_14 2 5_to_14 23 6 -20.4 29.8
# ℹ 38 more rows
# ℹ 3 more variables: prev <dbl>, time_order_loc <dbl>, prev_updated <dbl>
Observation: OK, we see that within one location there are several prevalence data points, they differ by the xcoord
and ycoord
. In order to plot by location, we could average across xcoord
and ycoord
witin each location; maybe those are duplicated recordings, since xcoord
and ycoord
are very close?
#Plotting evolution over time, fix 3
%>%
data_use_orderedgroup_by(location,month,year,age_group)%>%
summarize(prev_updated_mean=mean(prev_updated),
prev_updated_min=min(prev_updated),
prev_updated_max=max(prev_updated))%>%
::ggplot(mapping=aes(x=month,
ggplot2y=prev_updated_mean,
file=location,
group=location,
colour=location))+
::geom_line(lwd=1.1)+
ggplot2::facet_wrap(age_group~year)+
ggplot2::theme_bw()+
ggplot2::xlab("Month of the Year")+
ggplot2::ylab("Prevalence")+
ggplot2::scale_x_discrete(limits=factor(1:12),
ggplot2labels=c("J","F","M",
"A","M","J",
"J","A","S",
"O","N","D"))+
::scale_y_continuous(breaks=seq(from=0,
ggplot2to=0.7,
by=0.1))
Observation: Prevalence widely variable throughout they year across the locations on average, wonderland affected by high prevalence while oz has the lowest prevalence
Need to check (not just prevalence) but count of cases and total vulnerable
#Check case count
<- tidyr::pivot_longer(data=data_use_ordered,
data_use_ordered_long cols=c("positive","total"),
names_to="Outcome",
values_to="counts")
<- ggplot2::ggplot(data=data_use_ordered_long%>%
mordor_stacked_bar_graph ::filter(location=="mordor"),
dplyrmapping=aes(x=month,
y=counts,
fill=Outcome))+
::scale_x_discrete(limits=factor(1:12),
ggplot2labels=c("J","F","M",
"A","M","J",
"J","A","S",
"O","N","D"))+
::geom_bar(position="stack", stat="identity")+
ggplot2::facet_wrap(~year)+
ggplot2::theme_bw()+
ggplot2::xlab("Month of the Year")+
ggplot2::ylab("Count")
ggplot2
mordor_stacked_bar_graph
Observation: Stacked bar graph adds positive and total counts, better to show them side by side as positive counts are a subset of the total counts. This is a specified by the argument position="dodge"
in the geom_bar
geometry of ggplot2
.
#Case count, bargraph dodge
<- ggplot2::ggplot(data=data_use_ordered_long%>%
mordor_dodged_bar_graph ::filter(location=="mordor"),
dplyrmapping=aes(x=month,
y=counts,
fill=Outcome))+
::scale_x_discrete(limits=factor(1:12),
ggplot2labels=c("J","F","M",
"A","M","J",
"J","A","S",
"O","N","D"))+
::geom_bar(position="dodge", stat="identity")+
ggplot2::facet_wrap(~year)+
ggplot2::theme_bw()+
ggplot2::xlab("Month of the Year")+
ggplot2::ylab("Count")
ggplot2
mordor_dodged_bar_graph
The mosquito data set
Let’s take a look at the mosq_data
dataset.
We check the sanity of this data set by displaying a table of recorded values per column:
%>%
mosq_data map( function(x) table(x) )
$session
x
1 2
52 52
$Village
x
naernia narnia
2 102
$Compound.ID
x
1 2 3 4
26 26 26 26
$Method
x
ALC HLC
1 103
$Location
x
Indoor Outdoor
52 52
$hour
x
01h-02h 02h-03h 03h-04h 04h-05h 05h-06h 06h-07h 07h-08h 19h-20h 20h-21h 21h-22h
8 8 8 8 8 8 8 8 8 8
22h-23h 23h-24h 24h-01h
8 8 8
$ag.Male
x
0 3 4 5 6 7 14 16 20 22 27 35
93 1 1 1 1 1 1 1 1 1 1 1
$Ag.unfed
x
0 1 2 3 4 5 6 7 8 10 20
57 13 7 8 4 4 2 4 2 1 2
$Ag.halffed
x
0 3 4 5 8 9
92 3 3 3 1 2
$Ag.fed
x
0 1 3 5
88 7 3 6
$Ag.grsgr
x
0 1 2 3 4 6 8 12 17 20 23 27 35 37
70 13 6 1 2 1 3 2 1 1 1 1 1 1
$tot.gamb
x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 23
40 10 10 9 8 6 3 2 1 1 5 2 1 2 2 2
$Culex.male
x
0
104
$Culex.female
x
0 1 2
94 9 1
$Mansonia.male
x
0 1
103 1
$Mansonia.female
x
0 1 2
90 11 3
$Aedes.male
x
0
104
$Aedes.female
x
0 1 2
98 5 1
Looks like we have some typos in the names for Method
and Village
.
- Create a schema that provides rules for the strings (i.e. words) to be expected in the columns
Method
andVillage
. - Use the syntax from here
<- validate::validator(Method%in%c("HLC"),
schema %in%c("narnia"))
Village
<- validate::confront(mosq_data, schema)
out summary(out)
name items passes fails nNA error warning expression
1 V1 104 103 1 0 FALSE FALSE Method %vin% c("HLC")
2 V2 104 102 2 0 FALSE FALSE Village %vin% c("narnia")
The columns Village
and Method
seem to have some data entry errors. We need to correct for that.
<-mosq_data%>%
mosq_datamutate(Method=ifelse(Method=="ALC","HLC",Method),
Village=ifelse(Village=="naernia","narnia",Village))
It looks like the several columns concern Anopheles Gambiae population sizes. Let’s change the column names using rename
from the tidyverse
package.
%>%
mosq_datarename("AnophelesGambiae.male"="ag.Male",
"AnophelesGambiae.unfed"="Ag.unfed",
"AnophelesGambiae.halffed"="Ag.halffed",
"AnophelesGambiae.fed"="Ag.fed",
"AnophelesGambiae.gravid"="Ag.grsgr")->mosq_data
Seems like the tot.gamb
should count the the total number of Anopheles Gambiae populations. Let’s check:
%>%
mosq_datamutate(AnophelesGambiae_total=AnophelesGambiae.male+AnophelesGambiae.unfed+AnophelesGambiae.halffed+AnophelesGambiae.fed+AnophelesGambiae.gravid)->mosq_data
%>%
mosq_datafilter(AnophelesGambiae_total!=tot.gamb)%>%select(AnophelesGambiae_total,tot.gamb)
# A tibble: 11 × 2
AnophelesGambiae_total tot.gamb
<dbl> <dbl>
1 12 0
2 16 2
3 0 6
4 24 8
5 24 1
6 74 12
7 54 3
8 70 1
9 34 2
10 40 2
11 46 0
OK, so 11 out of 104 rows have this discrepancy. Let’s keep rather Anopheles.total
, since it was calculated from the data.
Since the status of the Anopheles is mutually exclusive in the HLC data, we can draw a stacked bar chart, with the bar color defined by the status. To produce such a graph efficiently in ggplot2
, we need to pivot the table.
Here in particular we want to switch from a wide format to a long format table in order to obtain a column describing the status of the Anopheles mosquitoes. We will use in particular the names_sep
argument of the pivot_longer
function to separate e.g. the column name AnophelesGambiae.male
and use male
as level in a new column called status
. The same goes for other column names.
Setting the grouping variable to session
, Village
, Compound.ID
, Method
, Location
, hour
, AnophelesGambiae_total
will help to keep those variables in the long format table.
%>%
mosq_datagroup_by(session,Village,Compound.ID,Method,Location,hour,AnophelesGambiae_total)%>%
select(contains("AnophelesGambiae."))%>%
pivot_longer(cols=contains("AnophelesGambiae."),names_sep="AnophelesGambiae.",names_to=c(NA,"status"),values_to = "AnophelesGambiae")->mosq_data_gamb_wide
%>%
mosq_data_gamb_wideggplot()+
geom_bar(aes(x=hour,y=AnophelesGambiae,fill=status),position="stack",stat="identity")+
scale_x_discrete(guide = guide_axis(angle = 60))
Observation: We had several values for Compound.ID
. The geom_bar
geometry is automatically adding them up in the graph. We can use facet_wrap
to see those strata:
%>%
mosq_data_gamb_wideggplot()+
geom_bar(aes(x=hour,y=AnophelesGambiae,fill=status),position="stack",stat="identity")+
scale_x_discrete(guide = guide_axis(angle = 60))+
facet_wrap(~Compound.ID)
On we can also use our variable Anopheles_total
and plot is as a line graph on top of the bar graph:
%>%
mosq_data_gamb_widemutate(grouping=paste0(Compound.ID,Location,session))%>%
ggplot()+
geom_bar(aes(x=hour,y=AnophelesGambiae,fill=status),position="stack",stat="identity")+
geom_line(aes(x=hour,y=AnophelesGambiae_total,group=grouping))+
scale_x_discrete(guide = guide_axis(angle = 60))+
facet_wrap(~Compound.ID+session+Location)