Tutorial on the evaluation of vaccines with administrative data data
A companion to the WHO reference guide
##Getting started Before starting, you should install R and Rstudio on your computer. R can be downloaded from here: https://cran.r-project.org/mirrors.html. and RStudio can be downloaded from https://rstudio.com/products/rstudio/. Both are free. Once both programs are installed, open RStudio, and either open the .Rmd file practical exercises.Rmd or create a new RMarkdown document File/New File/R Notebook.
Load the libraries we need for this exercise. The first time you do this, run the code that is commented off with the install.packages function. You should only have to do this once.
#install.packages(c('lubridate', 'RCurl','devtools','xtable','knitr','htmlTable','coda','rmdformats','httr'))
#library(devtools)
#devtools::install_github('https://github.com/weinbergerlab/InterventionEvaluatR')
library(lubridate)##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(RCurl)
library(knitr)
library(htmlTable)
library(InterventionEvaluatR)
library(coda)
library(rmdformats)
library(httr)
library(xtable)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(ggplot2)Introduction to the tutorial
In this tutorial, you will learn about: 1. The types of administrative data 2. The minimum requirement of the data for performing a credible analysis 3. How to extract relevant information from these databases 4. How to format the data into time series 5. How to perform visual quality checks 6. How to fit models to the data to evaluate vaccine impact 7. How to interpret and describe the results
This tutorial is designed for individuals without extensive experience in R, but if you want to learn about the R software, which underlies these anayses, you can learn some basics in this other tutorial: https://weinbergerlab.shinyapps.io/SurveillanceIntroR/. In this tutorial, we will format and analyze publicly-available data from Ecuador. these data for 2001-2015 were downloaded from the Instituto Nacional de Estadisticas y Censos https://www.ecuadorencifras.gob.ec/defunciones-generales-y-fetales-bases-de-datos/
The key variables we will work with are: -Cause: The main cause of death (‘causa basica’), with the first 3-digits of the ICD10 code -date: formatted as YYYY-MM-01 -age.years age in years
This dataset has all deaths for children <59 months of age.
PCV7 in Aug-2010 PCV10 (2+1) in Feb-2011 PCV10 (3+0) in Feb-2014
The data are stored as an R dataset.
Read in and explore data
CHANGE THIS TO DOWNLOAD FROM GITHUB
githubURL <- ("https://raw.githubusercontent.com/DanWeinberger/who-guidance-materials/master/Practical%20exercises/ecuador%20raw/u5.deaths.2001.2015.rds")
download.file(githubURL,"ecuador_raw.rds", method="curl") #downloads file to working directory
d1 <- readRDS("ecuador_raw.rds") #imports file into R and saves as object 'd1'Look at the first few lines
head(d1)## date age.years cause
## 17 2001-01-01 0.58630137 Q89
## 26 2001-01-01 1.08767123 A09
## 37 2001-01-01 0.08493151 P23
## 61 2001-01-01 0.58630137 J18
## 71 2001-01-01 0.58630137 J96
## 80 2001-02-01 0.25205479 J18
Make sure R knows that the ‘date’ variable is a date. We tell it tha the date has a 4 digit year (%Y), a 2 digit month (%m), and 2 digit day (%d), separated by dashes.
d1$date<-as.Date(d1$date,"%Y-%m-%d")Let’s also make sure that cause is stored as a character variable
d1$cause<-as.character(d1$cause)Basic explorations. What is the distibution of ages? of Dates? (make a histogram for each)
hist(d1$age.years)hist(d1$date, breaks=10)This shows there are some erroneous dates. There shouldn’t be any here before Jan 2001. Filter it here
d1<-d1[d1$date>=as.Date('2001-01-01'),]
hist(d1$date, breaks=10)Which codes are the most commonly used in this database?
Make a table of the 50 most common codes
##
## J18 P07 P36 P22 A09 Q24 P23 R07 R99 P20 P24 R50 Q89 E43 J20 P28
## 6844 6650 3035 2605 2516 2137 1501 1455 1388 1355 1339 1338 1238 1188 1125 1115
## A41 R09 P21 W74 V09 X59 R63 R05 J96 A04 Q03 P29 P39 V89 P77 Q21
## 1028 851 827 713 695 625 603 578 557 530 527 503 498 482 442 427
## W84 E46 G80 Q90 I46 I50 Q79 R95 P02 Q25 W80 P96 G03 P52 R11 Q20
## 414 403 402 399 395 394 389 385 374 371 360 357 350 336 332 321
## E86 J69
## 318 305
This shows that J18 (pneumonia, cause unspecified) is the most commonly-coded cause of death. P07 is a perinatal code (disorders of neborns related to short gestation and low birthweight, not otherwise specified). A09 is infectious gastroenteritis and colitis.
The ICD10 codes are organized into broad ‘chapters’ of related conditions, roughly based on organ system. The first letter of the ICD code indicates the chapter Let’s look at the frequency of the ICD10 chapters to see bigger-picture patterns.
chapter.prim<-substr(d1$cause,1,1)
sort(table(chapter.prim),decreasing=T)## chapter.prim
## P J R Q A W E G I V X C D
## 23620 9979 8316 7961 4468 2515 2455 1763 1722 1338 1240 959 697
## K Y B N O M L S H F j
## 679 438 365 365 115 20 12 10 9 4 2 1
This shows that the perinatal codes (P) are most common, followed by respiratory codes (J), and poorly-specified causes (included in R chapter)
Create variables that we need for analysis
###Pneumonia
#Initialize variables
d1$j12_j18<-rep(0, nrow(d1))
d1$j12_j18[d1$cause %in% c('J12', 'J13', 'J14', 'J15', 'J16', 'J17', 'J18') ]<-1
table(d1$j12_j18)##
## 0 1
## 62760 7037
Let’s aggregate now by date
d2 <- d1 %>%
group_by(date) %>%
summarize('j12_j18'=sum(j12_j18)) %>%
ungroup()Plot your time series
p1 <- ggplot(d2, aes(x=date, y=j12_j18))+
geom_line() + #line plot
ylim(0,NA) + #extend yaxis to 0
theme_classic() #make it look nice
p1## Warning: Removed 1 row(s) containing missing values (geom_path).
Create some age groups and stratify the data
0 if <2 mo 1 if 2-11 mo 2 if 12-23 mo 3 if 24-59 mo
Create the variable:
d1$agec <- NA
d1$agec[d1$age.years>=0 & d1$age.years<2/12] <-0
d1$agec[d1$age.years>=2/12 & d1$age.years<1] <-1
d1$agec[d1$age.years>=1 & d1$age.years<2] <-2
d1$agec[d1$age.years>=2 & d1$age.years<5] <-3
d1$agec <- factor(d1$agec, labels=c('<2m','2-11m','12-23m','24-59m'))
d1 <- d1[!is.na(d1$agec),]d4 <- d1 %>%
group_by(agec,date) %>%
summarize('j12_j18'=sum(j12_j18)) %>%
ungroup()## `summarise()` has grouped output by 'agec'. You can override using the `.groups`
## argument.
p2 <- ggplot(d4, aes(x=date, y=j12_j18)) +
geom_line() +
theme_classic() +
ylim(0,NA) +
facet_wrap(~agec, scales='free')
p2Extract some control variables
- Flag any records that has a potential pneumococcal-related code anywhere in the record
# possible_pneumo_code: any pneumo code anywhere
d1$possible_pneumo_code <- 0
d1$possible_pneumo_code[d1$cause %in% c("A40","A49","B953","R652","H10","H65", "H66","G00","G01","G02","G03","G04")] <-1A00-B99 primary cause of death:
d1$A00_B99_prim <- ifelse(c(substr(d1$cause, 1, 1) %in% c("A","B")), 1, 0)
d1$A00_B99_prim <- ifelse(c(substr(d1$cause, 1, 2) =="A0" | # Exclude A00-A09 (rotavax)
d1$possible_pneumo_code==1), 0, d1$A00_B99_prim) Check your work:
table(d1$A00_B99_prim, d1$possible_pneumo_code) # Good.##
## 0 1
## 0 66634 748
## 1 1671 0
Also extract some more specific sub-chapter codes
table(d1$cause[substr(d1$cause, 1, 2)=="A1"])##
## A15 A16 A17 A18 A19
## 1 43 12 2 6
d1$A15_A19_prim <- ifelse(c(substr(d1$cause, 1, 2) =="A1"), 1, 0)
d1$A15_A19_prim <- ifelse(d1$possible_pneumo_code==1, 0, d1$A15_A19_prim)
table(d1$cause[substr(d1$cause, 1, 2)=="A1"], d1$possible_pneumo_code[substr(d1$cause, 1, 2)=="A1"])##
## 0
## A15 1
## A16 43
## A17 12
## A18 2
## A19 6
Check your work
table(d1$A15_A19_prim, d1$possible_pneumo_code)##
## 0 1
## 0 68241 748
## 1 64 0
And let’s add in a few more controls
d1$p_prim<- 1*(substr(d1$cause,1,1)=='P') #logical test for whether code starts with P
d1$p_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present
d1$q_prim<- 1*(substr(d1$cause,1,1)=='Q') #logical test for whether code starts with P
d1$q_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present
d1$w_prim<- 1*(substr(d1$cause,1,1)=='W') #logical test for whether code starts with w
d1$w_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present
d1$g_prim<- 1*(substr(d1$cause,1,1)=='G') #logical test for whether code starts with P
d1$g_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is presentNow aggregate the outcome and the controls
One important step here is to make sure that every time step is represented in the data, even if there are no cases. We do this using tidyr::complete
d2 <- d1 %>%
group_by(agec, date) %>%
summarize('j12_j18'=sum(j12_j18) ,
'A15_A19_prim'=sum(A15_A19_prim),
'A00_B99_prim'=sum(A00_B99_prim),
'p_prim'=sum(p_prim),
'q_prim'=sum(q_prim),
'w_prim'=sum(w_prim),
'g_prim'=sum(g_prim)
) %>%
tidyr::complete(date=seq.Date(min(date, na.rm=T), max(date, na.rm=T), 'month'),
fill=list(j12_j18=0, A15_A19_prim=0,
A00_B99_prim=0,
p_prim=0,
q_prim=0,
w_prim=0,
g_prim=0
)) %>%
ungroup()## `summarise()` has grouped output by 'agec'. You can override using the `.groups`
## argument.
Set up for analysis
###Troubleshooting notes If the analyses here are not working or giving an error, 90% of the time it is because either your date variable is not in the correct format or because the data are not correctly sorted. Use as.Date() to tell R that the date variable is a date. Check the variable types using str(ds) And the data should be sorted by stratum (e.g., age group) and then by date.
###Load the data We are going to switch and start using a pre-formatted time series from Ecuador that has a larger number of control variables and has been more extensively cleaned. It was created using similar steps to the ones used above.
Let’s load the data, which are included with the InterventionEvaluatR package
data(ecuador_mortality, package = "InterventionEvaluatR") #load the data
head(ecuador_mortality[,1:5]) #View first few rows and columns## age_group monthdate J12_J18_prim acm_noj_prim A00_B99_prim
## 11776 ec 2-23m A 2005-01-01 39 146 4
## 11777 ec 2-23m A 2005-02-01 43 128 4
## 11778 ec 2-23m A 2005-03-01 43 115 7
## 11779 ec 2-23m A 2005-04-01 37 138 4
## 11780 ec 2-23m A 2005-05-01 53 159 6
## 11781 ec 2-23m A 2005-06-01 29 128 9
ds<-ecuador_mortalityEnsure your date variable is in an R date format. If your variable is in a character or factor format, you need to tell R the format. – %m is a 2 digit month; %b is a month abbreviation (ie Jan, Feb) – %d is a 2 digit day (0-31), – %Y is a 4 digit year (e.g. 2011), %y is a 2 digit year (e.g. 11).
These codes are separated by a dash or slash or space. Modify the tryFormats script below if needed to match the format in your dataset
ds$monthdate<-as.Date(ds$monthdate, tryFormats=c('%Y-%m-%d',
'%m-%d-%Y',
'%m/%d/%Y',
'%Y/%m/%d',
'%d/%m/%Y'
) )Check the variable types
str(ds)## 'data.frame': 432 obs. of 198 variables:
## $ age_group : Factor w/ 3 levels "ec 2-23m A","ec 2-59m A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ monthdate : Date, format: "2005-01-01" "2005-02-01" ...
## $ J12_J18_prim: int 39 43 43 37 53 29 26 30 22 20 ...
## $ acm_noj_prim: int 146 128 115 138 159 128 110 122 100 113 ...
## $ A00_B99_prim: int 4 4 7 4 6 9 3 3 5 2 ...
## $ A00_A09_prim: int 24 14 12 23 21 16 30 18 14 9 ...
## $ A15_A19_prim: int 0 0 1 2 0 0 0 1 0 0 ...
## $ A20_A28_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ A30_49_prim : int 4 1 3 0 4 7 1 1 3 1 ...
## $ A50_A64_prim: int 0 0 0 0 0 0 0 0 0 1 ...
## $ A65_A69_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ A70_A74_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ A75_A79_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ A80_A89_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ A09_A99_prim: int 0 0 2 1 1 0 1 0 0 0 ...
## $ B00_B09_prim: int 0 1 0 0 0 0 0 0 0 0 ...
## $ B15_B19_prim: int 0 0 0 0 0 1 0 0 0 0 ...
## $ B20_B24_prim: int 0 1 0 0 1 0 1 1 0 0 ...
## $ B25_B34_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B35_B49_prim: int 0 0 0 0 0 1 0 0 2 0 ...
## $ B50_B64_prim: int 0 1 1 1 0 0 0 0 0 0 ...
## $ B65_B83_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B85_B89_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B90_B94_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B95_B98_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ B99_B99_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ C00_D48_prim: int 7 8 1 3 3 5 3 3 2 3 ...
## $ C00_C97_prim: int 2 4 1 0 1 2 0 2 1 2 ...
## $ D00_D09_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ D10_D36_prim: int 0 0 0 0 0 0 0 0 1 0 ...
## $ D37_D48_prim: int 1 0 0 0 0 0 0 0 0 0 ...
## $ D50_D53_prim: int 1 1 0 0 0 0 2 0 0 1 ...
## $ D55_D59_prim: int 0 0 0 0 0 1 0 0 0 0 ...
## $ D60_D64_prim: int 1 3 0 1 1 0 0 1 0 0 ...
## $ D65_D69_prim: int 2 0 0 2 1 2 1 0 0 0 ...
## $ D70_D77_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ D80_D89_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ E00_E90_prim: int 17 17 15 17 13 8 17 10 11 11 ...
## $ E00_E07_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ E10_E14_prim: int 1 0 0 0 0 0 0 1 0 0 ...
## $ E15_E16_prim: int 0 0 1 0 0 0 0 0 0 0 ...
## $ E20_E35_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ E40_E46_prim: int 11 14 14 15 12 6 13 8 9 8 ...
## $ E50_E64_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ E65_E68_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ E70_E90_prim: int 5 3 0 2 1 2 4 1 2 3 ...
## $ F00_F99_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G00_G99_prim: int 3 3 5 0 2 3 1 6 2 3 ...
## $ G00_G09_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G10_G14_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G20_G26_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G30_G32_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G35_G37_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G40_G47_prim: int 0 2 3 0 0 0 0 1 1 0 ...
## $ G50_G59_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G60_G64_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G70_G73_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ G80_G83_prim: int 0 1 0 0 1 3 0 2 1 2 ...
## $ G90_G99_prim: int 3 0 2 0 1 0 1 3 0 1 ...
## $ H00_H59_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H60_H95_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H00_H06_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H10_H13_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H15_H22_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H25_H28_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H30_H36_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H40_H42_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H43_H45_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H46_H48_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H49_H52_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H53_H54_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H55_H59_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H60_H62_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H65_H75_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H80_H83_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ H90_H95_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ I00_I99_prim: int 6 9 4 5 3 3 2 8 6 10 ...
## $ I00_I02_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ I05_I09_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ I10_I15_prim: int 1 0 0 0 0 1 0 1 0 0 ...
## $ I20_I25_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ I26_I28_prim: int 0 1 0 0 0 0 0 0 0 2 ...
## $ I30_I52_prim: int 5 6 4 5 2 2 1 7 6 8 ...
## $ I60_I69_prim: int 0 1 0 0 1 0 1 0 0 0 ...
## $ I70_I79_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ I80_I89_prim: int 0 1 0 0 0 0 0 0 0 0 ...
## $ I95_I99_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ K00_K99_prim: int 1 2 0 3 2 0 0 2 0 6 ...
## $ K00_K14_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ K20_K31_prim: int 0 0 0 0 0 0 0 0 0 1 ...
## $ K35_K38_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ K40_K46_prim: int 0 0 0 0 0 0 0 0 0 1 ...
## $ K50_K52_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## $ K55_K63_prim: int 1 1 0 2 0 0 0 0 0 1 ...
## $ K65_K67_prim: int 0 0 0 0 0 0 0 0 0 1 ...
## $ K70_K77_prim: int 0 0 0 1 1 0 0 1 0 1 ...
## $ K80_K87_prim: int 0 1 0 0 0 0 0 0 0 0 ...
## $ K90_K93_prim: int 0 0 0 0 1 0 0 1 0 1 ...
## $ L00_L99_prim: int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
Set parameters for analysis
Here we need to set a few parameters. We use the evaluatr.init() function to specify the name of the dataset, the date at which the vaccine is introduced, the date at which we want to begin evaluating the vaccine (typically 1-2 year after vaccine introduction). We also provide some information on the dataset, sch as whether the data are monthly or quarterly (n_seasons), the variable names for the grouping variable, the date variable, the outcome variable, and the denominator variable (if any). You can also set the number of interations for the MCMC. the default is to use a burn-in period of 5000 iterations and to sample 10,000 iterations afterthe burn in. This is a decent place to start. After evaluating model convergence (see below), you might want to increase the burn-in period.
analysis <- evaluatr.init(
country = "Ecuador", data = ds,
post_period_start = "2010-08-01", #First 'post-intervention' month is August 2010
eval_period_start = "2011-08-01", #We ignore first 12 month of data to allow for vaccine ramp up
eval_period_end = "2015-12-01", #The evaluation period lasts 4 years
n_seasons = 12, #This is monthly data, so select 12
year_def = "cal_year", # we are in southern hemisphere, so aggregate results by calendar year (Jan-Dec)
group_name = "age_group", #Strata categry name
date_name = "monthdate", #Date variable name
outcome_name = "J12_J18_prim", #Outcome variable name
denom_name = "acm_noj_prim" , #Denominator variable name
log.covars=TRUE, #log-transform the covariates
set.burnN=5000,
set.sampleN=10000
)
set.seed(1)Sort data by age group and month
ds<-ds[order(ds[, analysis$group_name], ds[,analysis$date_name]),] #Sort data by age group and monthRun analyses
This function runs Interrupted time series analysis, an alternative time trend analysis, synthetic controls, and STL+PCA. All of these results are saved in the object ‘impact_results’.
Run a simple control analysis, controlling for 1 control variable at a time
Before getting into more complicated analyses, we will first try to fit a simple Poisson regression model (with overdispersion) where we adjust for seasonality and 1 control variable at a time. These models are fit to data from the pre-vaccine periods and extrapolated to the post-vaccine period. Fitting 1 control variable at a time allows us to see how the use of different controls influences the results. This take a few minutes, so we will use a pre-run set of results.
glmer_results= evaluatr.univariate(analysis)Then plot the results. The results are ordered by goodness of fit (based on AIC scores), with best fitting covariates on top. Each plot represents a different age group. Overall, we see a generally consistent pattern. The use of the subchapter R00-09 as a control variable leads to estimates that are closer to 1 (no effect). This subchapter is “Symptoms and signs involving the circulatory and respiratory systems”. These are often considered non-specific ‘junk’ codes. There could be arguments for or against using this subchapter as a control. On the downside, it is possible that actual pneumonia deaths incorrectly were assigned a code of R00-99, and the vaccine could therefore reduce the incidence of R00-09 codes and bias the estimates towards no effect. On the upside, the use of these junk codes as a control could help to adjust for underlying improvements or changes in coding quality.
par(mar=c(4,5,1,1)) #fix margins
group.labels<-as.character(unique(analysis$input_data[,analysis$group_name]))
lapply(glmer_results,evaluatr.univariate.plot)## [[1]]
##
## [[2]]
Run Synthetic control analysis
For teaching purposes, this code has been pre-run since it takes some time and computational resources.
impact_results = evaluatr.impact(analysis)
saveRDS(impact_results,'./Results/ec_sc.rds')Were there any strata that were too sparse to analyze?
Compare estimates from different models
This shows the estimated rate ratio and 95% credible intervals from a synthetic controls analysis; a time-trend analysis where we used the specified denominator (all non-respiratory deaths) to adjust the number of pneumonia deaths in each month and a linear trend for time; a classic interrupted time series analysis (segmented regression); and the STL+PCA approach, which smooths and combines the control variables prior to including them in the model.
results.table<- cbind.data.frame(
impact_results$full$rr_mean_intervals,
impact_results$time$rr_mean_intervals,
impact_results$its$rr_mean_intervals,
impact_results$pca$rr_mean_intervals)
table<-xtable(results.table)
htmlTable(table)| Synthetic controls Estimate (95% CI) | Time trend Estimate (95% CI) | Classic ITS (95% CI) | STL+PCA Estimate (95% CI) | |
|---|---|---|---|---|
| ec 2-23m A | 0.8 (0.62, 1.06) | 0.83 (0.66, 1.06) | 0.93 (0.75, 1.14) | 1.05 (0.81, 1.35) |
| ec 24-59m A | 0.67 (0.58, 0.78) | 0.84 (0.57, 1.24) | 0.71 (0.49, 1.02) | 0.73 (0.39, 1.36) |
Cases averted
How many cases were prevented from the time of vaccine introduction to the last time point in each stratum (+/- 95% CrI)? You can modify the number ‘last.point’ to pull out the cumulative number of cases at any point in the time series. In this case we are printing results fromthe SC model
last.point<-dim(impact_results$full$cumsum_prevented)[1]
cum.prevented<-impact_results$full$cumsum_prevented[last.point,,]Format and print table
cum1<- round(t(cum.prevented))
cum2<- paste0(cum1[,'50%'], ' (', cum1[,'2.5%'],', ',cum1[,'97.5%'],')')
cum3<-cbind.data.frame(row.names(cum1), cum2)
names(cum3)<-c('Stratum','Cases Averted (95% CrI)')
htmlTable(cum3, align='l')| Stratum | Cases Averted (95% CrI) | |
|---|---|---|
| 1 | ec 2-23m A | 309 (-20, 718) |
| 2 | ec 24-59m A | 162 (96, 235) |
Number of variables selected in SC analysis
The Synthetic controls analysis uses a Bayesian variable selection algorithm to weight the candidate covariates. In each MCMC iteration, it tests a different combination of variables. The model size indicates how many variables are selected in any given model. If <1 variable is selected on average, this indicates that no suitable control variables were identified. In this example 1-2 variables were selected on average in the 2-23m and 2-59 m age categories, while no controls were identified in the 24-59m age group (the average model size is <1 (0.44)).
model_size = data.frame(t(analysis$model_size))
htmlTable(model_size, align='c')| ec.2.23m.A | ec.24.59m.A | |
|---|---|---|
| 1 | 2.05 | 0.43 |
Inclusion Probabilities
incl_probs <- NULL
for (group in analysis$groups) {
incl_prob <- impact_results$full$groups[[group]]$inclusion_probs[-c(1:(analysis$n_seasons - 1)), ]
incl_prob <- incl_prob[order(-incl_prob$inclusion_probs), ]
incl_prob <- incl_prob[c(1:3), ]
incl_prob2 <- round(incl_prob[, 2],2)
incl_prob_names <- incl_prob[, 1]
incl_prob3 <- data.frame("Group" = group, "Greatest Inclusion Variable" = incl_prob_names[1], "Greatest Inclusion Probability" = incl_prob2[1], "Second Greatest Inclusion Variable" = incl_prob_names[2], "Second Greatest Inclusion Probability" = incl_prob2[2], "Third Greatest Inclusion Variable" = incl_prob_names[3], "Third Greatest Inclusion Probability" = incl_prob2[3], check.names = FALSE)
incl_probs <- rbind(incl_probs, incl_prob3)
}
rownames(incl_probs) <- NULLHere we can see which variables received the most weight in the models for each strata. Values range from 0-1, with values closer to 1 indicating that the variable was included in a larger proportion of the variable combinations that were tested A value of 1 would indicate that the variable was always included in the model, a value of 0.48 would indicate the variable was included in 48% of the models that were tested.
htmlTable(incl_probs)| Group | Greatest Inclusion Variable | Greatest Inclusion Probability | Second Greatest Inclusion Variable | Second Greatest Inclusion Probability | Third Greatest Inclusion Variable | Third Greatest Inclusion Probability | |
|---|---|---|---|---|---|---|---|
| 1 | ec 2-23m A | P00_P96_prim | 0.52 | R00_R09_prim | 0.34 | A00_A09_prim | 0.33 |
| 2 | ec 24-59m A | acm_noj_prim | 0.1 | R00_R99_prim | 0.1 | V01_X59_prim | 0.09 |
###Sensitivity analysis In the sensitivity analyses, we re-run the synthetic controls analysis, but sequentially drop the top 1,2, or 3 variables that were most influential (received the most weight) in the original analysis. This allows us to evaluate whether the results are sensitive to the inclusion of a specific variable
Load the pre-run results
The purpose of this analysis is to evaluate the sensitivity of the synthetic controls results to the inclusion of specific control variables. The model is first fit with all variables, then the top-weighted variable is excluded, and the model is re-run. If the results change drastically after dropping 1 variable, this indicates that that variable is highly influential on the results. If the results don’t change, this indicates that the results are robust to the exclusion of that variable (this assumes that the variable that is excluded received a decent amount of weight prior to exclusion)
if (exists("sensitivity_results")) {
htmlTable(sensitivity_results$sensitivity_table_intervals)
}| Estimate (95% CI) | Top Control 1 | Inclusion Probability of Control 1 | Control 1 Estimate (95% CI) | Top Control 2 | Inclusion Probability of Control 2 | Control 2 Estimate (95% CI) | Top Control 3 | Inclusion Probability of Control 3 | Control 3 Estimate (95% CI) | |
|---|---|---|---|---|---|---|---|---|---|---|
| ec 2-23m A | 0.8 (0.62, 1.06) | P00_P96_prim | 0.52 | 0.8 (0.6, 1.07) | R00_R09_prim | 0.34 | 0.75 (0.59, 0.93) | A00_A09_prim | 0.33 | 0.68 (0.58, 0.85) |
| ec 24-59m A | 0.67 (0.58, 0.78) | acm_noj_prim | 0.1 | 0.67 (0.58, 0.77) | R00_R99_prim | 0.1 | 0.66 (0.58, 0.76) | V01_X59_prim | 0.09 | 0.67 (0.59, 0.77) |
Generate and save the plots
Plot the results for 1 age group
First look at the results from the synthetic controls model for 1 age group.
This first plot shows a monthly time series, with observed, fitted, and counterfacual values. The observed number of deaths is shown in the black line. The fitted values for the pre-vaccine period are shown in the red dotted line, and the counterfactual estimate with its 95% credible interval is shown as the white dotted line and gray shaded area. if the black line is below the gray shaded area, this would indicate that obsrved cases are lower than expected based on changes in the control diseases in the post-vaccine period. If the controls appropriately adjust for underlying trends, then this would reflect an effect of the vaccine.
plots$groups[["ec 2-23m A"]]$pred_full It is sometimes easier to look at the results if we aggregate the observed and expected values up to the annual time scale. Here the observed values are shown as black dots. When the black dots go below the gray shaded area, this indicates that the observed cases are lower than expected based on changes in the control diseases in the post-vaccine period. If the controls appropriately adjust for underlying trends, then this would reflect an effect of the vaccine. The vertical line indicates the date of vaccine introduction. The year when the vaccine is introduced is included to the right of the line, even if the vaccine was introduced part-way through the year. For instance, regardless of whether the the vaccine was introduced in January 2009 or November 2009, the observed dot for 2009 would be to the right of the line.
plots$groups[["ec 2-23m A"]]$pred_full_agg Finally, we can look at the cumulative cases prevented. In this example, there have been 445 cases prevented (95%CrI: 58, 931) from the time of vaccine introduction to the last day month of the study period. This is calculated by takin the difference between the observed and fitted number of cases in each month, and summing them. If atleast 1 control disease is identified from the synthetic controls model, then the result here is drawn from that model, otherwise, it is drawn from the STL+PCA model.
plots$groups[["ec 2-23m A"]]$cumsum_prevented Printing plots for all models and age groups
We instead might want to just print everything for all age groups and models. We can use the following code to do that
Plot Observed vs expected yearly time series
par(mfrow=c(4,1))
for (group in names(plots$groups)) {
print(plots$groups[[group]]$pred_full_agg )
print(plots$groups[[group]]$pred_best_agg )
print(plots$groups[[group]]$pred_time_agg )
print(plots$groups[[group]]$pred_pca_agg )
}NULL
NULL