Tutorial on the evaluation of vaccines with administrative data data

A companion to the WHO reference guide

##Getting started Before starting, you should install R and Rstudio on your computer. R can be downloaded from here: https://cran.r-project.org/mirrors.html. and RStudio can be downloaded from https://rstudio.com/products/rstudio/. Both are free. Once both programs are installed, open RStudio, and either open the .Rmd file practical exercises.Rmd or create a new RMarkdown document File/New File/R Notebook.

Load the libraries we need for this exercise. The first time you do this, run the code that is commented off with the install.packages function. You should only have to do this once.

#install.packages(c('lubridate', 'RCurl','devtools','xtable','knitr','htmlTable','coda','rmdformats','httr'))
#library(devtools) 
#devtools::install_github('https://github.com/weinbergerlab/InterventionEvaluatR')
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(RCurl)
library(knitr)
library(htmlTable)
library(InterventionEvaluatR)
library(coda)
library(rmdformats)
library(httr)
library(xtable)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(ggplot2)

Introduction to the tutorial

In this tutorial, you will learn about: 1. The types of administrative data 2. The minimum requirement of the data for performing a credible analysis 3. How to extract relevant information from these databases 4. How to format the data into time series 5. How to perform visual quality checks 6. How to fit models to the data to evaluate vaccine impact 7. How to interpret and describe the results

This tutorial is designed for individuals without extensive experience in R, but if you want to learn about the R software, which underlies these anayses, you can learn some basics in this other tutorial: https://weinbergerlab.shinyapps.io/SurveillanceIntroR/. In this tutorial, we will format and analyze publicly-available data from Ecuador. these data for 2001-2015 were downloaded from the Instituto Nacional de Estadisticas y Censos https://www.ecuadorencifras.gob.ec/defunciones-generales-y-fetales-bases-de-datos/

The key variables we will work with are: -Cause: The main cause of death (‘causa basica’), with the first 3-digits of the ICD10 code -date: formatted as YYYY-MM-01 -age.years age in years

This dataset has all deaths for children <59 months of age.

PCV7 in Aug-2010 PCV10 (2+1) in Feb-2011 PCV10 (3+0) in Feb-2014

The data are stored as an R dataset.

Read in and explore data

CHANGE THIS TO DOWNLOAD FROM GITHUB

githubURL <- ("https://raw.githubusercontent.com/DanWeinberger/who-guidance-materials/master/Practical%20exercises/ecuador%20raw/u5.deaths.2001.2015.rds")

download.file(githubURL,"ecuador_raw.rds", method="curl") #downloads file to working directory

d1 <- readRDS("ecuador_raw.rds") #imports file into R and saves as object 'd1'

Look at the first few lines

head(d1)
##          date  age.years cause
## 17 2001-01-01 0.58630137   Q89
## 26 2001-01-01 1.08767123   A09
## 37 2001-01-01 0.08493151   P23
## 61 2001-01-01 0.58630137   J18
## 71 2001-01-01 0.58630137   J96
## 80 2001-02-01 0.25205479   J18

Make sure R knows that the ‘date’ variable is a date. We tell it tha the date has a 4 digit year (%Y), a 2 digit month (%m), and 2 digit day (%d), separated by dashes.

d1$date<-as.Date(d1$date,"%Y-%m-%d")

Let’s also make sure that cause is stored as a character variable

d1$cause<-as.character(d1$cause)

Basic explorations. What is the distibution of ages? of Dates? (make a histogram for each)

hist(d1$age.years)

hist(d1$date, breaks=10)

This shows there are some erroneous dates. There shouldn’t be any here before Jan 2001. Filter it here

d1<-d1[d1$date>=as.Date('2001-01-01'),]
hist(d1$date, breaks=10)

Which codes are the most commonly used in this database?

Make a table of the 50 most common codes

## 
##  J18  P07  P36  P22  A09  Q24  P23  R07  R99  P20  P24  R50  Q89  E43  J20  P28 
## 6844 6650 3035 2605 2516 2137 1501 1455 1388 1355 1339 1338 1238 1188 1125 1115 
##  A41  R09  P21  W74  V09  X59  R63  R05  J96  A04  Q03  P29  P39  V89  P77  Q21 
## 1028  851  827  713  695  625  603  578  557  530  527  503  498  482  442  427 
##  W84  E46  G80  Q90  I46  I50  Q79  R95  P02  Q25  W80  P96  G03  P52  R11  Q20 
##  414  403  402  399  395  394  389  385  374  371  360  357  350  336  332  321 
##  E86  J69 
##  318  305

This shows that J18 (pneumonia, cause unspecified) is the most commonly-coded cause of death. P07 is a perinatal code (disorders of neborns related to short gestation and low birthweight, not otherwise specified). A09 is infectious gastroenteritis and colitis.

The ICD10 codes are organized into broad ‘chapters’ of related conditions, roughly based on organ system. The first letter of the ICD code indicates the chapter Let’s look at the frequency of the ICD10 chapters to see bigger-picture patterns.

chapter.prim<-substr(d1$cause,1,1)
sort(table(chapter.prim),decreasing=T)
## chapter.prim
##     P     J     R     Q     A     W     E     G     I     V     X     C     D 
## 23620  9979  8316  7961  4468  2515  2455  1763  1722  1338  1240   959   697 
##     K     Y     B     N     O     M     L     S     H     F           j 
##   679   438   365   365   115    20    12    10     9     4     2     1

This shows that the perinatal codes (P) are most common, followed by respiratory codes (J), and poorly-specified causes (included in R chapter)

Create variables that we need for analysis

###Pneumonia

#Initialize variables
d1$j12_j18<-rep(0, nrow(d1))
d1$j12_j18[d1$cause %in% c('J12', 'J13', 'J14', 'J15', 'J16', 'J17', 'J18') ]<-1
table(d1$j12_j18)
## 
##     0     1 
## 62760  7037

Let’s aggregate now by date

d2 <- d1 %>%
  group_by(date) %>%
  summarize('j12_j18'=sum(j12_j18)) %>%
  ungroup()

Plot your time series

p1 <- ggplot(d2, aes(x=date, y=j12_j18))+
  geom_line() +  #line plot
  ylim(0,NA) + #extend yaxis to 0
  theme_classic() #make it look nice

p1
## Warning: Removed 1 row(s) containing missing values (geom_path).

Create some age groups and stratify the data

0 if <2 mo 1 if 2-11 mo 2 if 12-23 mo 3 if 24-59 mo

Create the variable:

d1$agec <- NA
d1$agec[d1$age.years>=0 & d1$age.years<2/12] <-0
d1$agec[d1$age.years>=2/12 & d1$age.years<1] <-1
d1$agec[d1$age.years>=1 & d1$age.years<2] <-2
d1$agec[d1$age.years>=2 & d1$age.years<5] <-3

d1$agec <- factor(d1$agec,  labels=c('<2m','2-11m','12-23m','24-59m'))

d1 <- d1[!is.na(d1$agec),]
d4 <- d1 %>%
  group_by(agec,date) %>%
  summarize('j12_j18'=sum(j12_j18)) %>%
  ungroup()
## `summarise()` has grouped output by 'agec'. You can override using the `.groups`
## argument.
p2 <- ggplot(d4, aes(x=date, y=j12_j18)) +
  geom_line() +
  theme_classic() +
  ylim(0,NA) +
  facet_wrap(~agec, scales='free')

p2

Extract some control variables

  1. Flag any records that has a potential pneumococcal-related code anywhere in the record
# possible_pneumo_code: any pneumo code anywhere
d1$possible_pneumo_code <- 0
d1$possible_pneumo_code[d1$cause %in% c("A40","A49","B953","R652","H10","H65",                                                                   "H66","G00","G01","G02","G03","G04")] <-1

A00-B99 primary cause of death:

d1$A00_B99_prim <- ifelse(c(substr(d1$cause, 1, 1) %in% c("A","B")), 1, 0)
d1$A00_B99_prim <- ifelse(c(substr(d1$cause, 1, 2) =="A0" | # Exclude A00-A09 (rotavax)
                              d1$possible_pneumo_code==1), 0, d1$A00_B99_prim) 

Check your work:

table(d1$A00_B99_prim, d1$possible_pneumo_code) # Good.
##    
##         0     1
##   0 66634   748
##   1  1671     0

Also extract some more specific sub-chapter codes

table(d1$cause[substr(d1$cause, 1, 2)=="A1"])
## 
## A15 A16 A17 A18 A19 
##   1  43  12   2   6
d1$A15_A19_prim <- ifelse(c(substr(d1$cause, 1, 2) =="A1"), 1, 0)

d1$A15_A19_prim <- ifelse(d1$possible_pneumo_code==1, 0, d1$A15_A19_prim) 

table(d1$cause[substr(d1$cause, 1, 2)=="A1"], d1$possible_pneumo_code[substr(d1$cause, 1, 2)=="A1"])
##      
##        0
##   A15  1
##   A16 43
##   A17 12
##   A18  2
##   A19  6

Check your work

table(d1$A15_A19_prim, d1$possible_pneumo_code)
##    
##         0     1
##   0 68241   748
##   1    64     0

And let’s add in a few more controls

d1$p_prim<- 1*(substr(d1$cause,1,1)=='P') #logical test for whether code starts with P
d1$p_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present

d1$q_prim<- 1*(substr(d1$cause,1,1)=='Q') #logical test for whether code starts with P
d1$q_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present

d1$w_prim<- 1*(substr(d1$cause,1,1)=='W') #logical test for whether code starts with w
d1$w_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present

d1$g_prim<- 1*(substr(d1$cause,1,1)=='G') #logical test for whether code starts with P
d1$g_prim[d1$possible_pneumo_code==1]<-0 #don't count of pneumococcal code is present

Now aggregate the outcome and the controls

One important step here is to make sure that every time step is represented in the data, even if there are no cases. We do this using tidyr::complete

d2 <- d1 %>%
  group_by(agec, date) %>%
  summarize('j12_j18'=sum(j12_j18) ,
            'A15_A19_prim'=sum(A15_A19_prim),
            'A00_B99_prim'=sum(A00_B99_prim),
            'p_prim'=sum(p_prim),
            'q_prim'=sum(q_prim),
            'w_prim'=sum(w_prim),
            'g_prim'=sum(g_prim)
            ) %>%
    tidyr::complete(date=seq.Date(min(date, na.rm=T), max(date, na.rm=T), 'month'), 
                    fill=list(j12_j18=0,                                                                                              A15_A19_prim=0,
                              A00_B99_prim=0,
                              p_prim=0,
                              q_prim=0,
                              w_prim=0,
                              g_prim=0
                              )) %>%
  ungroup()
## `summarise()` has grouped output by 'agec'. You can override using the `.groups`
## argument.

Set up for analysis

###Troubleshooting notes If the analyses here are not working or giving an error, 90% of the time it is because either your date variable is not in the correct format or because the data are not correctly sorted. Use as.Date() to tell R that the date variable is a date. Check the variable types using str(ds) And the data should be sorted by stratum (e.g., age group) and then by date.

###Load the data We are going to switch and start using a pre-formatted time series from Ecuador that has a larger number of control variables and has been more extensively cleaned. It was created using similar steps to the ones used above.

Let’s load the data, which are included with the InterventionEvaluatR package

    data(ecuador_mortality, package = "InterventionEvaluatR") #load the data

    head(ecuador_mortality[,1:5]) #View first few rows and columns
##        age_group  monthdate J12_J18_prim acm_noj_prim A00_B99_prim
## 11776 ec 2-23m A 2005-01-01           39          146            4
## 11777 ec 2-23m A 2005-02-01           43          128            4
## 11778 ec 2-23m A 2005-03-01           43          115            7
## 11779 ec 2-23m A 2005-04-01           37          138            4
## 11780 ec 2-23m A 2005-05-01           53          159            6
## 11781 ec 2-23m A 2005-06-01           29          128            9
    ds<-ecuador_mortality

Ensure your date variable is in an R date format. If your variable is in a character or factor format, you need to tell R the format. – %m is a 2 digit month; %b is a month abbreviation (ie Jan, Feb) – %d is a 2 digit day (0-31), – %Y is a 4 digit year (e.g. 2011), %y is a 2 digit year (e.g. 11).
These codes are separated by a dash or slash or space. Modify the tryFormats script below if needed to match the format in your dataset

ds$monthdate<-as.Date(ds$monthdate, tryFormats=c('%Y-%m-%d',
                                                    '%m-%d-%Y',
                                                    '%m/%d/%Y',
                                                    '%Y/%m/%d',
                                                    '%d/%m/%Y'
                                                    ) )

Check the variable types

str(ds)
## 'data.frame':    432 obs. of  198 variables:
##  $ age_group   : Factor w/ 3 levels "ec 2-23m A","ec 2-59m A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ monthdate   : Date, format: "2005-01-01" "2005-02-01" ...
##  $ J12_J18_prim: int  39 43 43 37 53 29 26 30 22 20 ...
##  $ acm_noj_prim: int  146 128 115 138 159 128 110 122 100 113 ...
##  $ A00_B99_prim: int  4 4 7 4 6 9 3 3 5 2 ...
##  $ A00_A09_prim: int  24 14 12 23 21 16 30 18 14 9 ...
##  $ A15_A19_prim: int  0 0 1 2 0 0 0 1 0 0 ...
##  $ A20_A28_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ A30_49_prim : int  4 1 3 0 4 7 1 1 3 1 ...
##  $ A50_A64_prim: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ A65_A69_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ A70_A74_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ A75_A79_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ A80_A89_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ A09_A99_prim: int  0 0 2 1 1 0 1 0 0 0 ...
##  $ B00_B09_prim: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ B15_B19_prim: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ B20_B24_prim: int  0 1 0 0 1 0 1 1 0 0 ...
##  $ B25_B34_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B35_B49_prim: int  0 0 0 0 0 1 0 0 2 0 ...
##  $ B50_B64_prim: int  0 1 1 1 0 0 0 0 0 0 ...
##  $ B65_B83_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B85_B89_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B90_B94_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B95_B98_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ B99_B99_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ C00_D48_prim: int  7 8 1 3 3 5 3 3 2 3 ...
##  $ C00_C97_prim: int  2 4 1 0 1 2 0 2 1 2 ...
##  $ D00_D09_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ D10_D36_prim: int  0 0 0 0 0 0 0 0 1 0 ...
##  $ D37_D48_prim: int  1 0 0 0 0 0 0 0 0 0 ...
##  $ D50_D53_prim: int  1 1 0 0 0 0 2 0 0 1 ...
##  $ D55_D59_prim: int  0 0 0 0 0 1 0 0 0 0 ...
##  $ D60_D64_prim: int  1 3 0 1 1 0 0 1 0 0 ...
##  $ D65_D69_prim: int  2 0 0 2 1 2 1 0 0 0 ...
##  $ D70_D77_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ D80_D89_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ E00_E90_prim: int  17 17 15 17 13 8 17 10 11 11 ...
##  $ E00_E07_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ E10_E14_prim: int  1 0 0 0 0 0 0 1 0 0 ...
##  $ E15_E16_prim: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ E20_E35_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ E40_E46_prim: int  11 14 14 15 12 6 13 8 9 8 ...
##  $ E50_E64_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ E65_E68_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ E70_E90_prim: int  5 3 0 2 1 2 4 1 2 3 ...
##  $ F00_F99_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G00_G99_prim: int  3 3 5 0 2 3 1 6 2 3 ...
##  $ G00_G09_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G10_G14_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G20_G26_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G30_G32_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G35_G37_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G40_G47_prim: int  0 2 3 0 0 0 0 1 1 0 ...
##  $ G50_G59_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G60_G64_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G70_G73_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ G80_G83_prim: int  0 1 0 0 1 3 0 2 1 2 ...
##  $ G90_G99_prim: int  3 0 2 0 1 0 1 3 0 1 ...
##  $ H00_H59_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H60_H95_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H00_H06_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H10_H13_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H15_H22_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H25_H28_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H30_H36_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H40_H42_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H43_H45_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H46_H48_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H49_H52_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H53_H54_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H55_H59_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H60_H62_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H65_H75_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H80_H83_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ H90_H95_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I00_I99_prim: int  6 9 4 5 3 3 2 8 6 10 ...
##  $ I00_I02_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I05_I09_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I10_I15_prim: int  1 0 0 0 0 1 0 1 0 0 ...
##  $ I20_I25_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I26_I28_prim: int  0 1 0 0 0 0 0 0 0 2 ...
##  $ I30_I52_prim: int  5 6 4 5 2 2 1 7 6 8 ...
##  $ I60_I69_prim: int  0 1 0 0 1 0 1 0 0 0 ...
##  $ I70_I79_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ I80_I89_prim: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ I95_I99_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ K00_K99_prim: int  1 2 0 3 2 0 0 2 0 6 ...
##  $ K00_K14_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ K20_K31_prim: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ K35_K38_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ K40_K46_prim: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ K50_K52_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ K55_K63_prim: int  1 1 0 2 0 0 0 0 0 1 ...
##  $ K65_K67_prim: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ K70_K77_prim: int  0 0 0 1 1 0 0 1 0 1 ...
##  $ K80_K87_prim: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ K90_K93_prim: int  0 0 0 0 1 0 0 1 0 1 ...
##  $ L00_L99_prim: int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

Set parameters for analysis

Here we need to set a few parameters. We use the evaluatr.init() function to specify the name of the dataset, the date at which the vaccine is introduced, the date at which we want to begin evaluating the vaccine (typically 1-2 year after vaccine introduction). We also provide some information on the dataset, sch as whether the data are monthly or quarterly (n_seasons), the variable names for the grouping variable, the date variable, the outcome variable, and the denominator variable (if any). You can also set the number of interations for the MCMC. the default is to use a burn-in period of 5000 iterations and to sample 10,000 iterations afterthe burn in. This is a decent place to start. After evaluating model convergence (see below), you might want to increase the burn-in period.

analysis <- evaluatr.init(
  country = "Ecuador", data = ds,
  post_period_start = "2010-08-01", #First 'post-intervention' month is August 2010
  eval_period_start = "2011-08-01", #We ignore first 12 month of data to allow for vaccine ramp up
  eval_period_end = "2015-12-01", #The evaluation period lasts 4 years
  n_seasons = 12, #This is monthly data, so select 12
  year_def = "cal_year", # we are in southern hemisphere, so aggregate results by calendar year (Jan-Dec)
  group_name = "age_group",  #Strata categry name
  date_name = "monthdate", #Date variable name
  outcome_name = "J12_J18_prim", #Outcome variable name
  denom_name = "acm_noj_prim" , #Denominator variable name
  log.covars=TRUE, #log-transform the covariates
  set.burnN=5000,
  set.sampleN=10000
)
set.seed(1)

Sort data by age group and month

    ds<-ds[order(ds[, analysis$group_name], ds[,analysis$date_name]),] #Sort data by age group and month

Run analyses

This function runs Interrupted time series analysis, an alternative time trend analysis, synthetic controls, and STL+PCA. All of these results are saved in the object ‘impact_results’.

Run a simple control analysis, controlling for 1 control variable at a time

Before getting into more complicated analyses, we will first try to fit a simple Poisson regression model (with overdispersion) where we adjust for seasonality and 1 control variable at a time. These models are fit to data from the pre-vaccine periods and extrapolated to the post-vaccine period. Fitting 1 control variable at a time allows us to see how the use of different controls influences the results. This take a few minutes, so we will use a pre-run set of results.

 glmer_results= evaluatr.univariate(analysis)

Then plot the results. The results are ordered by goodness of fit (based on AIC scores), with best fitting covariates on top. Each plot represents a different age group. Overall, we see a generally consistent pattern. The use of the subchapter R00-09 as a control variable leads to estimates that are closer to 1 (no effect). This subchapter is “Symptoms and signs involving the circulatory and respiratory systems”. These are often considered non-specific ‘junk’ codes. There could be arguments for or against using this subchapter as a control. On the downside, it is possible that actual pneumonia deaths incorrectly were assigned a code of R00-99, and the vaccine could therefore reduce the incidence of R00-09 codes and bias the estimates towards no effect. On the upside, the use of these junk codes as a control could help to adjust for underlying improvements or changes in coding quality.

par(mar=c(4,5,1,1)) #fix margins
group.labels<-as.character(unique(analysis$input_data[,analysis$group_name]))
lapply(glmer_results,evaluatr.univariate.plot)
## [[1]]

## 
## [[2]]

Run Synthetic control analysis

For teaching purposes, this code has been pre-run since it takes some time and computational resources.

impact_results = evaluatr.impact(analysis)
saveRDS(impact_results,'./Results/ec_sc.rds')

Were there any strata that were too sparse to analyze?

Compare estimates from different models

This shows the estimated rate ratio and 95% credible intervals from a synthetic controls analysis; a time-trend analysis where we used the specified denominator (all non-respiratory deaths) to adjust the number of pneumonia deaths in each month and a linear trend for time; a classic interrupted time series analysis (segmented regression); and the STL+PCA approach, which smooths and combines the control variables prior to including them in the model.

results.table<- cbind.data.frame(
  impact_results$full$rr_mean_intervals, 
  impact_results$time$rr_mean_intervals, 
  impact_results$its$rr_mean_intervals, 
  impact_results$pca$rr_mean_intervals)

  table<-xtable(results.table)
  
  htmlTable(table)
Synthetic controls Estimate (95% CI) Time trend Estimate (95% CI) Classic ITS (95% CI) STL+PCA Estimate (95% CI)
ec 2-23m A 0.8 (0.62, 1.06) 0.83 (0.66, 1.06) 0.93 (0.75, 1.14) 1.05 (0.81, 1.35)
ec 24-59m A 0.67 (0.58, 0.78) 0.84 (0.57, 1.24) 0.71 (0.49, 1.02) 0.73 (0.39, 1.36)

Cases averted

How many cases were prevented from the time of vaccine introduction to the last time point in each stratum (+/- 95% CrI)? You can modify the number ‘last.point’ to pull out the cumulative number of cases at any point in the time series. In this case we are printing results fromthe SC model

last.point<-dim(impact_results$full$cumsum_prevented)[1]
cum.prevented<-impact_results$full$cumsum_prevented[last.point,,]

Format and print table

cum1<- round(t(cum.prevented))
cum2<- paste0(cum1[,'50%'], ' (', cum1[,'2.5%'],', ',cum1[,'97.5%'],')')
cum3<-cbind.data.frame(row.names(cum1), cum2)
names(cum3)<-c('Stratum','Cases Averted (95% CrI)')
  htmlTable(cum3, align='l')
Stratum Cases Averted (95% CrI)
1 ec 2-23m A 309 (-20, 718)
2 ec 24-59m A 162 (96, 235)

Number of variables selected in SC analysis

The Synthetic controls analysis uses a Bayesian variable selection algorithm to weight the candidate covariates. In each MCMC iteration, it tests a different combination of variables. The model size indicates how many variables are selected in any given model. If <1 variable is selected on average, this indicates that no suitable control variables were identified. In this example 1-2 variables were selected on average in the 2-23m and 2-59 m age categories, while no controls were identified in the 24-59m age group (the average model size is <1 (0.44)).

model_size = data.frame(t(analysis$model_size))
htmlTable(model_size, align='c')
ec.2.23m.A ec.24.59m.A
1 2.05 0.43

Inclusion Probabilities

incl_probs <- NULL
for (group in analysis$groups) {
  incl_prob <- impact_results$full$groups[[group]]$inclusion_probs[-c(1:(analysis$n_seasons - 1)), ]
  incl_prob <- incl_prob[order(-incl_prob$inclusion_probs), ]
  incl_prob <- incl_prob[c(1:3), ]
  incl_prob2 <- round(incl_prob[, 2],2)
  incl_prob_names <- incl_prob[, 1]
  incl_prob3 <- data.frame("Group" = group, "Greatest Inclusion Variable" = incl_prob_names[1], "Greatest Inclusion Probability" = incl_prob2[1], "Second Greatest Inclusion Variable" = incl_prob_names[2], "Second Greatest Inclusion Probability" = incl_prob2[2], "Third Greatest Inclusion Variable" = incl_prob_names[3], "Third Greatest Inclusion Probability" = incl_prob2[3], check.names = FALSE)
  incl_probs <- rbind(incl_probs, incl_prob3)
}
rownames(incl_probs) <- NULL

Here we can see which variables received the most weight in the models for each strata. Values range from 0-1, with values closer to 1 indicating that the variable was included in a larger proportion of the variable combinations that were tested A value of 1 would indicate that the variable was always included in the model, a value of 0.48 would indicate the variable was included in 48% of the models that were tested.

htmlTable(incl_probs)
Group Greatest Inclusion Variable Greatest Inclusion Probability Second Greatest Inclusion Variable Second Greatest Inclusion Probability Third Greatest Inclusion Variable Third Greatest Inclusion Probability
1 ec 2-23m A P00_P96_prim 0.52 R00_R09_prim 0.34 A00_A09_prim 0.33
2 ec 24-59m A acm_noj_prim 0.1 R00_R99_prim 0.1 V01_X59_prim 0.09

###Sensitivity analysis In the sensitivity analyses, we re-run the synthetic controls analysis, but sequentially drop the top 1,2, or 3 variables that were most influential (received the most weight) in the original analysis. This allows us to evaluate whether the results are sensitive to the inclusion of a specific variable

Load the pre-run results

The purpose of this analysis is to evaluate the sensitivity of the synthetic controls results to the inclusion of specific control variables. The model is first fit with all variables, then the top-weighted variable is excluded, and the model is re-run. If the results change drastically after dropping 1 variable, this indicates that that variable is highly influential on the results. If the results don’t change, this indicates that the results are robust to the exclusion of that variable (this assumes that the variable that is excluded received a decent amount of weight prior to exclusion)

if (exists("sensitivity_results")) {
  htmlTable(sensitivity_results$sensitivity_table_intervals)
}
Estimate (95% CI) Top Control 1 Inclusion Probability of Control 1 Control 1 Estimate (95% CI) Top Control 2 Inclusion Probability of Control 2 Control 2 Estimate (95% CI) Top Control 3 Inclusion Probability of Control 3 Control 3 Estimate (95% CI)
ec 2-23m A 0.8 (0.62, 1.06) P00_P96_prim 0.52 0.8 (0.6, 1.07) R00_R09_prim 0.34 0.75 (0.59, 0.93) A00_A09_prim 0.33 0.68 (0.58, 0.85)
ec 24-59m A 0.67 (0.58, 0.78) acm_noj_prim 0.1 0.67 (0.58, 0.77) R00_R99_prim 0.1 0.66 (0.58, 0.76) V01_X59_prim 0.09 0.67 (0.59, 0.77)

Generate and save the plots

Plot the results for 1 age group

First look at the results from the synthetic controls model for 1 age group.

This first plot shows a monthly time series, with observed, fitted, and counterfacual values. The observed number of deaths is shown in the black line. The fitted values for the pre-vaccine period are shown in the red dotted line, and the counterfactual estimate with its 95% credible interval is shown as the white dotted line and gray shaded area. if the black line is below the gray shaded area, this would indicate that obsrved cases are lower than expected based on changes in the control diseases in the post-vaccine period. If the controls appropriately adjust for underlying trends, then this would reflect an effect of the vaccine.

      plots$groups[["ec 2-23m A"]]$pred_full 

It is sometimes easier to look at the results if we aggregate the observed and expected values up to the annual time scale. Here the observed values are shown as black dots. When the black dots go below the gray shaded area, this indicates that the observed cases are lower than expected based on changes in the control diseases in the post-vaccine period. If the controls appropriately adjust for underlying trends, then this would reflect an effect of the vaccine. The vertical line indicates the date of vaccine introduction. The year when the vaccine is introduced is included to the right of the line, even if the vaccine was introduced part-way through the year. For instance, regardless of whether the the vaccine was introduced in January 2009 or November 2009, the observed dot for 2009 would be to the right of the line.

      plots$groups[["ec 2-23m A"]]$pred_full_agg 

Finally, we can look at the cumulative cases prevented. In this example, there have been 445 cases prevented (95%CrI: 58, 931) from the time of vaccine introduction to the last day month of the study period. This is calculated by takin the difference between the observed and fitted number of cases in each month, and summing them. If atleast 1 control disease is identified from the synthetic controls model, then the result here is drawn from that model, otherwise, it is drawn from the STL+PCA model.

      plots$groups[["ec 2-23m A"]]$cumsum_prevented 

Printing plots for all models and age groups

We instead might want to just print everything for all age groups and models. We can use the following code to do that

Plot Observed vs expected yearly time series

      par(mfrow=c(4,1))

for (group in names(plots$groups)) {
      print(plots$groups[[group]]$pred_full_agg )
      print(plots$groups[[group]]$pred_best_agg )
      print(plots$groups[[group]]$pred_time_agg )
      print(plots$groups[[group]]$pred_pca_agg )
}

NULL NULL

Save results