The reconstruction of the individual transmissions events that led to an outbreak gives valuable information on the causes of spread of an infectious disease. In recent years, methods have been developed to infer transmission trees from onset dates and genetic sequences. Nevertheless, sequences can be uninformative for pathogens with slow mutation rates, or sequencing can be scarce. We developed the package o2geosocial to integrate more variables from routinely collected surveillance data to reconstruct transmission trees when genetic sequences are not informative. The model introduced in o2geosocial takes the reported age-group, onset date, location and genotype of infected cases to infer probabilistic transmission trees. In this vignette, we describe the structure and the different functions of the package. We also provide a tutorial on a simulated measles outbreak showing how to run a model, evaluate the output and visualise the results of the inference. In the second part of the tutorial, we customise the likelihood functions to introduce a different mobility model.
The risk of outbreak in a given region is higher if there is low immunity against the virus in the population. Regional immunity against infectious diseases is built up by past infections and, if a vaccine is available, vaccination campaigns. Social and spatial heterogeneity in disease incidence or vaccine coverage lead to under-immunised areas, also called pockets of susceptibles. Importation of cases into these areas can cause large transmission clusters and long-lasting outbreaks. The most vulnerable areas of a country could be identified using historical data on local vaccine coverage and incidence, but these data are often scarce, or unreliable. Another solution is to infer probabilistic transmission trees and clusters to identify in which regions importations repeatedly caused large clusters of local transmission.
The Wallinga-Teunis method was developed to infer probabilistic transmission trees from onset dates, serial intervals and latent periods in a maximum likelihood framework. As genetic sequencing of pathogens during an outbreak became more common, new tools such as the R package outbreaker2 showed that combining the timing of infection and the genetic sequences could improve the accuracy of inferred transmission trees. Nevertheless, sequencing pathogens remains costly, and the efficacy of the reconstruction methods depends on the proportion of sequenced cases, the quality of the sequences, and on the characteristics of the pathogen. For instance, the measles virus evolves very slowly, so sequences from unrelated cases can be very similar, which makes methods combining onset dates and genetic sequences ineffective.
Building upon the framework presented in outbreaker2, we developed the R package o2geosocial to estimate the cluster size distribution from the onset date, age, location and genotype of the cases. Those variables are routinely collected by surveillance systems and often well reported. Using age-stratified contact matrices and mobility models, we combined the different variables into a likelihood of connection between cases. The package o2geosocial is ideal to study outbreaks where sequences are uninformative, either because only a small proportion of cases were sequenced or because the virus evolves too slowly. In this vignette, we first introduce the overall structure of o2geosocial, the components of the individual likelihood and the parameters; then we present a tutorial on how to develop a model to reconstruct the cluster size distribution with o2geosocial.
The general implementation of o2geosocial follows the
structure of outbreaker2
and incorporates C++ functions into a R framework using Rcpp. The
main function of the package, called outbreaker()
, uses
Monte Carlo Markov Chains (MCMC) to sample from the posterior
distribution. For each case, it infers the infection date, the infector,
and the number of missing generations between the case and their
infector. It takes five lists as inputs: i) ‘moves’, ii) ‘likelihoods’,
iii) ‘priors’, iv) ‘data’, and v) ‘config’. These five lists can be
generated and modified using the functions custom_moves()
,
custom_likelihoods()
, custom_priors()
,
create_config()
and outbreaker_data()
.
Examples of how these functions are used to customize the model will be
presented in the Tutorial section.
The package o2geosocial was developed for the study of outbreaks where full information on the onset date, location and age group of the cases is available. It builds upon outbreaker2 by adding the effect of the location and the age-stratified contact data on the reconstruction of the transmission clusters. However, unlike outbreaker2, o2geosocial does not take genetic sequences as input. Genetic information (or genotype) is only used to exclude connections between cases, i.e. two cases cannot be from the same cluster if they belong to different genetic groups. Therefore, o2geosocial is adapted to reconstructing transmission clusters from large datasets (up to several thousand cases) where genetic sequencing is not informative, either because the mutation rate of the virus is slow, or because sequencing is scarce. On the other hand, outbreaker2 is more adapted to reconstructing transmission chains if genetic sequences are informative and available for a high proportion of the cases.
In o2geosocial the genetic component of the likelihood is a binary value depending on the genotype of the cases. There can be only one genotype reported per transmission tree, hence the number of trees will be at least equal to the number of unique genotypes reported in the dataset. The genotype of a tree T is the genotype reported for at least one of the cases belonging to T. For each genotyped case igen and at every iteration, only cases from trees with the same genotype as igen, or without reported genotype should be listed as potential infectors. Therefore, ungenotyped cases belonging to a tree with a different genotype will not be potential infectors of igen at this iteration. Ungenotyped cases can be placed on any tree.
Routinely collected surveillance data may include thousands of cases
from unrelated outbreaks. Therefore, it was crucial to develop the
algorithm to be computationally efficient. We added a pre-clustering
step to the function outbreaker()
to reduce the pool of
potential infectors for each case, which can greatly reduce the
computing time of the MCMC (Figure 1). In the pre-clustering step, the
potential infectors for each case are listed. Cases with overlapping
potential infectors, and their potential infectors, are grouped
together. Cases from different groups cannot infect one another. The
group each case belongs to is listed in the variable ‘is_cluster’, which
is an element of the list returned by the function
outbreaker_data()
. We used three criteria to group together
cases that may be connected: For each case i, only cases reported in a radius
of γ km, less than δ days before i, and from similar or unreported
genotype can be classified as potential infectors of i. The parameter γ and δ are defined as inputs in the
function create_config()
.
Cases classified in the same group after the pre-clustering stage may
still be unrelated (e.g. if several importations were reported
simultaneously in a nearby region). Because of this, we defined a
likelihood threshold λ which
allows connections to be discarded as unlikely. If the likelihood of
connection between cases i and
j is less than λ, we consider it to be more likely
that i was an import and
unrelated to j. In
o2geosocial, the variable λ can either be an absolute (the
log-likelihood threshold will be log(λ))
or relative value (a quantile of the likelihood of all connections in
all samples), and is defined by the variables ‘outlier_threshold’ and
‘outlier_relative’ in create_config()
. If the initial
number of importations is too small, unlikely connections between cases
can alter the inferred infection dates of cases and bias the generated
transmission trees. Therefore, we first run a short MCMC and compute
n, the minimum number of
connections with a likelihood lower than λ in the samples; then we add n imports to the starting tree and
run a long MCMC. Finally, we remove the connections with likelihood
lower than λ in the final
samples and return the infector, infection date and probability of being
an import for each case (Figure 1).
The functions custom_likelihoods()
and
custom_priors()
can be used to edit each component of the
likelihood and priors.
The genetic component of the likelihood that a case i of genotype gi was infected by a case j belonging to the tree τj is defined as a binary value: $$G(g_{i},g_{\tau_{j}})= \left\{\begin{array}{l} 1\text{ if }g_{i}\text{ unknown}\\ 1\text{ if }g_{\tau_{j}}\text{ unknown }\\ 1\text{ if }g_{i}\text{ and }g_{\tau_{j}}\text{ both known and }g_i=g_{\tau_{j}} \\ 0\text{ otherwise} \end{array} \right. $$
Therefore, ungenotyped cases may not be potential infector if they belong to a tree that contains another genotype than gi. The algorithm must consider the genotype of each tree containing potential infectors.
As in the packages outbreaker and outbreaker2, we allow for missing cases in transmission chains. The number of generations between linked cases i and j κji is equal to one if j infected i. We define ρ as the conditional report ratio of the trees. The conditional report ratio is not the same as the overall report ratio of an outbreak because entirely unreported clusters, or missing cases before the ancestor of a tree will not change the value of ρ. Only unreported cases within transmission chains will impact the conditional report ratio. By default, the probability of observing κji missing generation between i and j from the conditional report ratio p(κji|ρ) follows an exponential distribution.
The conditional report ratio is estimated during the MCMC runs using
a beta distribution prior. The two parameters of the beta prior can be
changed using the variable prior_pi
in
create_config()
(default to Beta(10, 1)).
The time component of the likelihood is similar to what is used in
the default version of outbreaker2.
The probability of ti being the
infection date of the case i
reported at time Ti depends on
the distribution of the incubation period f. The incubation period should be
defined using the variable f_dens
in the function
outbreaker_data()
.
The probability that i was
infected by j given their
respective inferred dates of infection ti and tj is defined by
the generation time of the disease wκji(ti − tj)
(variable w_dens
in outbreaker_data()
), and
the number of generations κji
between i and j. The operator wκji
was defined as wκji = Πκjiw,
where Π is the convolution
operator.
In o2geosocial, we introduce a spatial component to the likelihood of connection between cases. By default, we implemented a gravity model to illustrate the probability of connection between two regions k and l. Gravity models depend on the population sizes mk and ml, and the distance between regions dkl. Given spatial parameters a and b, the probability that a case in the region k was infected by a case reported in l is s(k, l):
$$s(k,l)= \frac{p_{kl}}{\Sigma_hp_{hl}}=\frac{F(d_{kl}, b)* m_k^a*m_l^c}{\Sigma_h(F(d_{hl}, b)*m_h^a *m_l^c)}=\frac{F(d_{kl}, b)* m_k^a}{\Sigma_h(F(d_{hl}, b)*m_h^a)}$$
If we use an exponential gravity model, F(dkl, a) = e−b * dkl;
for a power-law gravity model: $F(d_{kl},a)=(\frac{1}{d_{kl}})^b$. The
exponential gravity model has been shown to perform well to represent
short-distance mobility patterns. As o2geosocial aims at
reconstructing transmission in a community or a region, the default
model in the function outbreaker()
is an exponential
gravity model. The type of gravity model can be changed by setting the
parameter spatial_method
to “power-law”:
create_config(spatial_method = "power_law")
. Other mobility
models can be used by developing modules. We give an example of module
in the tutorial where we replace the exponential gravity model by a
Stouffer’s rank model.
The parameters a and b are estimated during the MCMC run. This requires re-computing the probability matrix twice per iteration and can be time-consuming. Therefore, if either a or b is not fixed, we allow for a maximum of 1 missing generation between cases (max(κji) = 2) and only compute s1(k, l) and s2(k, l) for regions that could potentially be connected. By default, the prior distribution of a and b are uniform.
Social contact matrices provided by large scale quantitative
investigations such as the POLYMOD
study can be used to calculate the probability of contact
between cases of different age groups in different countries. These
social contact matrices are imported using the R package socialmixr.
In o2geosocial, given the age group of each case α(1, .., N) and
the age-stratified social contact matrix, we introduced aκji(αi, αj),
the probability that a case aged αj infected a
case aged αi. This
corresponds to the proportion of contacts to αi that came
from individuals of age αj. The contact
matrix can be modified using the variable a_dens
in
outbreaker_data()
.
The overall likelihood that a case i was infected by the case j is equal to Li(ti, j, tj, θ) = log(f(ti − Ti)) + Lji(ti, tj, θ) where Lji(ti, tj, θ) is the likelihood of connection between i and j and is defined as:
Lji(ti, tj, θ) = log(ρ(κji|ρ) * wκji(ti − tj) * aκji(αi, αj) * G(gi, gτj) * sκji(ri, rj|a, b))
At every iteration of the MCMC, a set of movements is used to propose
an update of the transmission trees. This update is then accepted or
rejected depending on the posterior density of the new trees. In
o2geosocial, eight default movements are tested at each
iteration. Three of them were already part of outbreaker2
and were not modified (cpp_move_t_inf()
to change the
infection date of cases; cpp_move_pi()
to change the
conditional report ratio; cpp_move_kappa()
to change the
number of generations between connected cases); two were edited to scan
of the transmissions trees to prevent from having different genotypes in
the same tree (cpp_move_alpha()
.
cpp_move_swap_cases()
); The last three are new movements
and were not part of outbreaker2:
cpp_move_a()
and cpp_move_b()
propose
new values of the spatial parameters a and b and compute the matrix of
probability of connection between regions.
cpp_move_ancestor()
: An ancestor is defined as the
first case of a transmission tree. As only non-ancestors are moved in
cpp_move_alpha()
, we added this function to ensure good
mixing of the ancestors of the transmission trees. For each ancestor
i, an index case is drawn from
the poll of potential infectors, while another link is randomly picked
and deleted. We then compare the new posterior value, and accept or
reject the change.
There are two simulated datasets attached to o2geosocial:
toy_outbreak_short
and toy_outbreak_long
. Both
are lists describing simulated outbreaks and include 3 elements:
i)cases
: a data table with the ID, location, onset date,
genotype, age group, import status, cluster, generation and infector of
each case; ii) dt_regions
: a data table with the ID,
population, longitude and latitude of each region;
iii)age_contact
: a numeric matrix of the proportion of
contact between age groups. Both simulations were ran using
distributions of the serial interval and the latent period typically
associated with measles outbreaks The dataset
toy_outbreak_long
contains 1940 cases simulated in the
United States between 2003 and 2016, it can be used to reproduce the
results described in https://github.com/alxsrobert/datapaperMO.
In this tutorial, we will analyse toy_outbreak_short
. We
will run a first model where the import status is inferred, and a second
model that takes the import status from the reference dataset and only
estimates the transmission trees. We will then evaluate the agreement
between the inferred and the reference transmission clusters, and
observe the added value of knowing the import status of the cases.
Finally, we will give insight into the interpretation of the results and
the geographical heterogeneity of the reconstructed transmission
dynamics.
We used the package data.table
for handling data through the tutorial as it is optimised to deal with
large datasets. The methods defined in o2geosocial
would
work similarly if we had used the data.frame
syntax and
format.
The results presented in this tutorial were generated using only 5,000 iterations to ensure a short run-time for the vignette. We recommend running longer chains to reach the convergence and sample from the posterior distribution, and using the package coda to evaluate the convergence of the MCMC chains.
The dataset contains 75 simulated cases from different census tracts
of Ohio in 2014 (variable cens_tract
). The variable
cluster
describes the transmission tree of each case, and
“generation” is equal to the number of generations between the first
case of the tree (generation = 1
) and the case. We import
o2geosocial and extract the dataset cases
from
toy_outbreak_short
.
library(o2geosocial)
## We used the data.table syntax throughout this example
library(data.table)
data("toy_outbreak_short")
print(toy_outbreak_short$cases, nrows = 10)
## ID State Date Genotype Cens_tract age_group import cluster
## <char> <char> <Date> <char> <char> <num> <lgcl> <num>
## 1: 112 Ohio 2014-01-01 B3 39005970100 6 TRUE 16
## 2: 75 Ohio 2014-01-06 D8 39139002400 11 TRUE 14
## 3: 116 Ohio 2014-01-12 B3 39101000400 11 TRUE 17
## 4: 113 Ohio 2014-01-13 B3 39005970100 6 FALSE 16
## 5: 145 Ohio 2014-01-13 D8 39117965300 8 TRUE 26
## ---
## 71: 55 Ohio 2014-05-27 B3 39147962500 9 FALSE 6
## 72: 129 Ohio 2014-05-30 G3 39139001500 3 TRUE 20
## 73: 142 Ohio 2014-05-31 B3 39101000600 4 TRUE 25
## 74: 143 Ohio 2014-06-10 B3 39101000200 4 FALSE 25
## 75: 144 Ohio 2014-06-12 B3 39101000501 13 FALSE 25
## generation infector_ID
## <num> <char>
## 1: 1 <NA>
## 2: 1 <NA>
## 3: 1 <NA>
## 4: 2 112
## 5: 1 <NA>
## ---
## 71: 3 54
## 72: 1 <NA>
## 73: 1 <NA>
## 74: 2 142
## 75: 2 142
First we plot the cluster size distribution of the reference data (Figure 2). 95% of the clusters contain less than 5 cases, 47.6% of the clusters are isolated (also called singletons). One larger cluster includes 31 cases (Figure 2).
# Reference cluster size distribution
hist(table(dt_cases$cluster), breaks = 0:max(table(dt_cases$cluster)),
ylab = "Number of clusters", xlab = "Cluster size", main = "", las=1)
We set up the distributions the model will use to reconstruct the
transmission trees. We define f_dens
as the duration of the
latent period, and w_dens
as the number of days between the
infection dates of two connected cases, also called the serial interval.
These distributions have previously been described for measles
outbreaks.
# Distribution of the latent period
f_dens <- dgamma(x = 1:100, scale = 0.43, shape = 26.83)
# Distribution of the generation time
w_dens <- dnorm(x = 1:100, mean = 11.7, sd = 2.0)
The age specific social contact patterns can be imported from the
element age_contact
of the list
toy_outbreak_short
. Alternatively, one can use the R
package socialmixr
to import a social contact matrix from the POLYMOD survey.
# From the list toy_outbreak_short
a_dens <- toy_outbreak_short$age_contact
# Alternatively, from POLYMOD:
polymod_matrix <-
t(socialmixr::contact_matrix(socialmixr::polymod,
countries="United Kingdom",
age.limits=seq(0, 70, by=5),
missing.contact.age="remove",
estimated.contact.age="mean",
missing.participant.age="remove")$matrix)
polymod_matrix <-data.table::as.data.table(polymod_matrix)
# Compute the proportion of connection to each age group
a_dens <- t(t(polymod_matrix)/colSums(polymod_matrix))
Finally, the distance matrix between regions is set up from the data
table dt_regions
, element of
toy_outbreak_short
. We use the column
population
to set up the population vector
pop_vect
. We compute the distance between each region into
the distance matrix dist_mat
using the package geosphere.
# Extract all regions in the territory
dt_regions <- toy_outbreak_short[["dt_regions"]]
# Extract the population vector
pop_vect <- dt_regions$population
# Create the matrices of coordinates for each region (one "from"; one "to")
mat_dist_from <- matrix(c(rep(dt_regions$long, nrow(dt_regions)),
rep(dt_regions$lat, nrow(dt_regions))), ncol = 2)
mat_dist_to <- matrix(c(rep(dt_regions$long, each = nrow(dt_regions)),
rep(dt_regions$lat, each = nrow(dt_regions))),
ncol = 2)
# Compute all the distances between the two matrices
all_dist <- geosphere::distGeo(mat_dist_from, mat_dist_to)
# Compile into a distance matrix
dist_mat <- matrix(all_dist/1000, nrow = nrow(dt_regions))
# Rename the matrix columns and rows, and the population vector
names(pop_vect) <- rownames(dist_mat) <- colnames(dist_mat) <-
dt_regions$region
Now that all the distributions and matrices are set up, we create the
lists data
, config
, moves
,
likelihoods
and priors
to run the main
function of the package. In this example, we use the default parameters
to build moves
, likelihoods
and
priors
with the same elements as described in section
“Implementation”. The list data
contains the distributions
f_dens
and w_dens
, the population vector and
the distance matrix, along with the onset dates, age group, location and
genotype of the cases.
We implement two different models: in out1
the import
status of the cases is inferred by the model, whereas in
out2
the import status is set before the MCMC. The first
short run in out1
is run with 1, 500 iterations to find the import status,
and the main run lasts for 5, 000
iterations in both models. As the import status of the cases is inferred
in out1
, we have to set a threshold to quantify what is an
unlikely likelihood of transmission between cases. We use a relative
outlier threshold at 0.9, which means that the threshold will be the
9th decile
of the negative log-likelihoods Li(ti, j, tj, θ)
in every sample.
# Default moves, likelihoods and priors
moves <- custom_moves()
likelihoods <- custom_likelihoods()
priors <- custom_priors()
# Data and config, model 1
data1 <- outbreaker_data(dates = dt_cases$Date, #Onset dates
age_group = dt_cases$age_group, #Age group
region = dt_cases$Cens_tract, #Location
genotype = dt_cases$Genotype, #Genotype
w_dens = w_dens, #Serial interval
f_dens = f_dens, #Latent period
a_dens = a_dens, #Age stratified contact matrix
population = pop_vect, #Population
distance = dist_mat #Distance matrix
)
config1 <- create_config(data = data1,
n_iter = 5000, #Iteration number: main run
n_iter_import = 1500, #Iteration number: short run
burnin = 500, #burnin period: first run
outlier_relative = T, #Absolute / relative threshold
outlier_threshold = 0.9 #Value of the threshold
)
# Run model 1
out1 <- outbreaker(data = data1, config = config1, moves = moves,
priors = priors, likelihoods = likelihoods)
# Set data and config for model 2
data2 <- outbreaker_data(dates = dt_cases$Date,
age_group = dt_cases$age_group,
region = dt_cases$Cens_tract,
genotype = dt_cases$Genotype, w_dens = w_dens,
f_dens = f_dens, a_dens = a_dens,
population = pop_vect, distance = dist_mat,
import = dt_cases$import #Import status of the cases
)
config2 <- create_config(data = data2,
find_import = FALSE, # No inference of import status
n_iter = 5000,
sample_every = 50, # 1 in 50 iterations is kept
burnin = 500)
# Run model 2
out2 <- outbreaker(data = data2, config = config2, moves = moves,
priors = priors, likelihoods = likelihoods)
The matrices out1
and out2
now contain the
posterior, likelihood, and prior of the trees generated at every
iteration, along with the values of the spatial parameters
a
and b
, the conditional report ratio
pi
, and the index, estimated infection date and number of
generations for each case.
The function summary
prints a summary of the features of
the output matrix generated by outbreaker()
. The function
generates a list with the number of steps, the distributions of the
posterior, likelihood and priors, the parameter distributions, the most
likely infector and the probability of being an import for each case,
and the cluster size distribution. For example, we can use it to
describe the distribution of the inferred parameters. We remove the
burn-in period, which corresponds to the number of iterations before the
MCMC converged.
burnin <- config1$burnin
# Summary parameters a and b
#Model 1
print(summary(out1, burnin = burnin)$a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2143 0.5634 0.8041 0.8166 0.9961 1.4849
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07689 0.09365 0.09951 0.09966 0.10536 0.13297
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2066 0.8161 1.0515 0.9894 1.2512 1.4806
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08977 0.11948 0.12786 0.12759 0.13952 0.16237
In order to evaluate the fit of the model, we plot the median
inferred cluster size distribution in out1
and
out2
and the credible intervals, and compare it with the
reference data. First, we group together clusters of similar sizes. The
breaks of each group is written in the vector
group_cluster
. In this example, we defined the size
categories as 1; 2; 3 − 4;
5 − 9; 10 − 15; 15 − 40 and 40+. The inferred cluster size distributions
are shown in the element cluster
of the list generated by
summary(out1)
, and can be aggregated using the input
variable group_cluster
. The generated barplot shows that
the number of isolated cases in out1
is lower than in the
data. Therefore, when the import status of cases was inferred, the model
underestimated the number of clusters and tended to link together
unrelated cases (Figure 3). On the other hand, the cluster size
distribution in out2
is very similar to the data.
Nevertheless, the cluster size distribution when the import status of
the cases is inferred depends on the likelihood threshold set in
outlier_threshold
and outlier_relative
. Using
a looser or stronger definitions of λ would impact the cluster size
distribution inferred in out1
.
# We create groups of cluster size: initialise the breaks for each group
group_cluster <- c(1, 2, 3, 5, 10, 15, 40, 100) - 1
# Reference data: h$counts
h <- hist(table(dt_cases$cluster), breaks = group_cluster, plot = FALSE)
# Grouped cluster size distribution in each run
clust_infer1 <- summary(out1, burnin = burnin,
group_cluster = group_cluster)$cluster
clust_infer2 <- summary(out2, group_cluster = group_cluster,
burnin = burnin)$cluster
# Merge inferred and reference cluster size distributions into one matrix
clust_size_matrix <- rbind(clust_infer1["Median",], clust_infer2["Median",],
h$counts)
# Plot
b <- barplot(clust_size_matrix, names.arg = colnames(clust_infer1), las=1,
ylab = "Number of clusters", xlab = "Cluster size", main = "",
beside = T, ylim = c(0, max(c(clust_infer1, clust_infer2))))
# Add the 50% CI
arrows(b[1,], clust_infer1["1st Qu.",], b[1,], clust_infer1["3rd Qu.",],
angle = 90, code = 3, length = 0.1)
arrows(b[2,], clust_infer2["1st Qu.",], b[2,], clust_infer2["3rd Qu.",],
angle = 90, code = 3, length = 0.1)
# Add legend
legend("topright", fill = grey.colors(3), bty = "n",
legend = c("Inferred import status",
"Known import status", "Simulated dataset"))
Although the aggregated cluster size distributions are close to the
reference data, we have to investigate the reconstructed transmission
trees to ensure the index assigned to each case is in agreement with the
reference dataset. We write two functions: in index_infer
we compute the proportion of iterations where the inferred index is the
actual index for each case (perfect match); In index_clust
we compute the proportion of iterations where the inferred index is from
the same reference cluster as the actual index (close match).
#' Title: Compute the proportion of iterations in the outbreaker() output
#` where the inferred index matches the actual index in dt_cases
#'
#' @param dt_cases: reference dataset
#' @param out: Matrix output of outbreaker()
#' @param burnin: Numeric, length of the burnin phase
#'
#' @return Numeric vector showing the proportion of iterations pointing to
#' the correct index case
index_infer <- function(dt_cases, out, burnin){
out_index <- out[out$step > burnin, grep("alpha", colnames(out))]
ID_index <- matrix(dt_cases[unlist(out_index), ID], ncol = nrow(dt_cases))
match_infer_data <- t(ID_index) == dt_cases$infector_ID
match_infer_data[is.na(t(ID_index)) & is.na(dt_cases$infector_ID)] <- TRUE
prop_correct <- rowSums(match_infer_data, na.rm = T)/ncol(match_infer_data)
return(prop_correct)
}
# Same as index_infer, except it returns the proportion of inferred indexes
# who are in the same reference cluster as the case
index_clust <- function(dt_cases, out, burnin){
out_index <- out[out$step > burnin, grep("alpha", colnames(out))]
clust_index <- matrix(dt_cases[unlist(out_index), cluster],
ncol = nrow(dt_cases))
match_infer_data <- t(clust_index) == dt_cases$cluster
match_infer_data <- match_infer_data[!is.na(dt_cases$infector_ID),]
prop_correct <- rowSums(match_infer_data, na.rm = T)/ncol(match_infer_data)
return(prop_correct)
}
# Run index_infer for each model
index_infer1 <- index_infer(dt_cases = dt_cases, out = out1, burnin = burnin)
index_infer2 <- index_infer(dt_cases = dt_cases, out = out2, burnin = burnin)
# Run index_clust for each model
index_clust1 <- index_clust(dt_cases = dt_cases, out = out1, burnin = burnin)
index_clust2 <- index_clust(dt_cases = dt_cases, out = out2, burnin = burnin)
The plots show that inferring the import status of cases decreased the proportion of perfect and close match for most cases (Figure 4). This highlights the importance of using reliable contact tracing investigations to investigate the import status of cases.
# Plot the sorted proportion in each model
oldpar <- par(no.readonly =TRUE)
par(bty = "n", mfrow = c(1, 2), mar = c(5,4,2,0), oma = c(0, 0, 0, 0))
# Panel A
plot(sort(index_infer1), type = "l", ylab = "Proportion", xlab = "Case",
main = "A", las=1, col = grey.colors(3)[1], lwd = 3)
lines(sort(index_infer2), col = grey.colors(3)[2], lwd = 3)
# Panel B
plot(sort(index_clust1), type = "l", xlab = "Case", ylab = "",
main = "B", las=1, col = grey.colors(3)[1], lwd = 3)
lines(sort(index_clust2), col = grey.colors(3)[2], lwd = 3)
legend("bottomright", col = grey.colors(3)[1:2], lwd = 3, bty = "n",
legend = c("Inferred import status",
"Known import status"))
We now generate two maps to investigate the geographical distribution
of the importations, and the average number of secondary cases per
region in out1
and out2
. The maps are
generated using the package ggplot2
First, we retrieve the boundary files of the census tract in Ohio. They will be used to generate the background of the maps, and are available using the library tigris. The boundary files are also available on the website census.gov. We import them in a format compatible with the package sf and create one background map for each model.
library(ggplot2)
# Read the shapefile and create one map for each model
map1 <- tigris::tracts(state = "Ohio", class = "sf", progress_bar = FALSE)
map1$INTPTLON <- as.numeric(map1$INTPTLON)
map1$INTPTLAT <- as.numeric(map1$INTPTLAT)
map2 <- map1
map1$model <- "Model 1"
map2$model <- "Model 2"
We are interested in two outputs of the models: i) the number of imports per region to show the regions where importations of cases are most likely, and ii) the geographical distribution of the number of secondary cases per case, which give insight into the areas most susceptible to the spread of the disease.
We compute the proportion of iterations where each case was an import
in out1 and out2. The element tree
of
summary(out1)
gives the most likely infector, the
proportion of iterations where the index is the most likely infector
(i.e. the support of the connection) and the median number of
generations between the two cases, the most likely infection date and
the chances of being an import for each case. We therefore add the
columns prop_infer1
and prop_infer2
to
dt_cases
. As the import status is not inferred in
out2
, prop_infer2
is a binary value, and is
the same as dt_cases$import
.
# Add the proportion of iteration in model 1 where each case is an import
dt_cases[, prop_infer1 := summary(out1, burnin = burnin)$tree$import]
# Add the proportion of iteration in model 2 where each case is an import
dt_cases[, prop_infer2 := summary(out2, burnin = burnin)$tree$import]
We then aggregate the number of imports per region in each model, and
name these vectors prop_reg1
and prop_reg2
. We
then add the number of imports in each region to the matrices describing
the maps.
# Number of import per region in model 1
prop_reg1 <- dt_cases[, .(prop_per_reg = sum(prop_infer1)),
by = Cens_tract]$prop_per_reg
# Number of import per region in model 2
prop_reg2 <- dt_cases[, .(prop_per_reg = sum(prop_infer2)),
by = Cens_tract]$prop_per_reg
names(prop_reg1) <- names(prop_reg2) <- unique(dt_cases$Cens_tract)
# Add the number of imports in each region to the maps
map1$prop_reg <- prop_reg1[as.character(map1$GEOID)]
map2$prop_reg <- prop_reg2[as.character(map2$GEOID)]
We now plot the number of imports per region in each model (Figure
5). Regions where no cases were reported are shown in grey. The right
panel (Model out2
) shows the geographical distribution of
importations in the data.
We observe discrepancies between the two panels: although some regions have the correct number of imports inferred in most iterations, there are uncertainties for some imports in the left panel. Furthermore, the right panel shows there should be one import in 4 of the regions in the central area, which seems to be underestimated in the left panel. These maps highlight the uncertainty added by the inference of the import status of each case.
# Merge maps
maps <- rbind(map1, map2)
limits_long <- c(-84, -82)
limits_lat <- c(40, 41.5)
maps <- maps[maps$INTPTLON > limits_long[1] & maps$INTPTLON < limits_long[2],]
maps <- maps[maps$INTPTLAT > limits_lat[1] & maps$INTPTLAT < limits_lat[2],]
# Plot: number of imports per region, two panels
ggplot(maps) + geom_sf(aes(fill = prop_reg)) + facet_grid(~model) +
scale_fill_gradient2(na.value = "lightgrey", midpoint = 0.8,
breaks = c(0, 0.5, 1, 1.5), name = "Nb imports",
low = "white", mid = "lightblue", high = "darkblue") +
coord_sf(xlim = c(-83.8, -82.2), ylim = c(40.2, 41.3)) +
theme_classic(base_size = 9)
We now compute the average number of secondary cases per case in each
region. We define the function
n_sec_per_reg¬, which computes the number of secondary cases per case and aggregates it per region at each iteration of the model. We then extract the median number of secondary cases per case in each region. The dispersion of the number of secondary cases per region can be generated by taking another quantile than the median. For example,
n_sec1
<- apply(n_sec_tot1[,-1], 1, function(X) quantile(X, 0.25))` would
show the first quartile in each region.
#' Title: Compute the number of secondary cases per case in each region
#'
#' @param dt_cases: reference dataset
#' @param out: Matrix output of outbreaker()
#' @param burnin: Numeric, length of the burnin phase
#'
#' @return A numeric matrix: the first column is the census tract ID, the
#' other columns show the number of secondary cases per case. Each row
#' corresponds to a different iteration.
n_sec_per_reg <- function(dt_cases, out, burnin){
## Number of secondary cases per case
n_sec <- apply(out[out$step > burnin, grep("alpha", colnames(out))], 1,
function(X){
X <- factor(X, 1:length(X))
return(table(X))})
## Aggregate by region
tot_n_sec_reg <- aggregate(n_sec, list(dt_cases$Cens_tract), sum)
## Divide by the number of cases in each region
tot_n_sec_reg <- cbind(tot_n_sec_reg[, 1],
tot_n_sec_reg[, -1] / table(dt_cases$Cens_tract))
return(tot_n_sec_reg)
}
n_sec_tot1 <- n_sec_per_reg(dt_cases = dt_cases, out = out1, burnin = burnin)
n_sec_tot2 <- n_sec_per_reg(dt_cases = dt_cases, out = out2, burnin = burnin)
n_sec1 <- apply(n_sec_tot1[,-1], 1, median)
n_sec2 <- apply(n_sec_tot2[,-1], 1, median)
names(n_sec1) <- names(n_sec2) <- unique(dt_cases$Cens_tract)
We then add the number of secondary cases per region to the matrices describing the maps.
map1$n_sec <- as.numeric(n_sec1[as.character(map1$GEOID)])
map2$n_sec <- as.numeric(n_sec2[as.character(map2$GEOID)])
We can now plot the geographical distribution of the median number of
secondary cases in each region (Figure 6). The maps generated by the two
models are very similar. Both of them show the spatial heterogeneity of
the number of secondary infections. Some regions seem to consistently
cause more secondary cases. There are minor discrepancies between the
maps, but they show the same pattern. If we change the vectors
n_sec1
and n_sec2
to plot different deciles,
we can get an idea of the dispersion of the number of secondary cases in
the different iterations of the models.
# Merge maps
maps <- rbind(map1, map2)
limits_long <- c(-84, -82)
limits_lat <- c(40, 41.5)
maps <- maps[maps$INTPTLON > limits_long[1] & maps$INTPTLON < limits_long[2],]
maps <- maps[maps$INTPTLAT > limits_lat[1] & maps$INTPTLAT < limits_lat[2],]
# Plot the geographical distribution of the number of secondary cases
ggplot(maps) + geom_sf(aes(fill = n_sec)) + facet_grid(~model) +
scale_fill_gradient2(na.value = "lightgrey", mid = "lightblue",
low = "white", midpoint = 1, high = "darkblue",
breaks = seq(0, 5, 0.5),name = "Sec cases") +
coord_sf(xlim = c(-83.8, -82.2), ylim = c(40.2, 41.3)) +
theme_classic(base_size = 9)
In the previous section, we ran and evaluated two different models to reconstruct transmission clusters from simulated surveillance data, and highlighted the spatial heterogeneity of measles transmission in the region. These models were run using the default likelihood, prior and movement functions. Now we develop a third model, where the spatial connection between regions is based on the Stouffer’s rank method.
The Stouffer’s rank model does not take absolute distance into account to compute the probability of connection between regions. In this model, the connectivity between the regions k and l only depends on the summed population of all the towns closer to l than k. If we define this collection of towns Ωk, l = {i : 0 ≤ d(i, l) ≤ d(k, l)}, the distance of Stouffer is then $p_{kl} = m_l^c * \left(\frac{m_k}{\sum_{i \in \Omega_{k,l}} m_i}\right)^a$. From this, we deduce the probability that a case from region l was infected by a case from region k as $$s(k,l)= \frac{p_{kl}}{\Sigma_hp_{hl}}=\frac{\left(\frac{m_k}{\sum_{i \in \Omega_{k,l}} m_i}\right)^a}{\Sigma_h\left(\frac{m_h}{\sum_{i \in \Omega_{h,l}} m_i}\right)^a}$$
This model is similar to the power-law gravity model with two main differences:
Each cell of the distance matrix should be equal to ∑i ∈ Ωk, lmi.
There is only one spatial parameter a to estimate.
First, we create the initial distance matrix in the Stouffer’s rank
model dist_mat_stouffer
:
# For every column of the distance matrix, use the cumulative sum of the
# population vector ordered by the distance. Remove the values where
# the distance between the regions is above gamma
dist_mat_stouffer <- apply(dist_mat, 2, function(X){
pop_X <- cumsum(pop_vect[order(X)])
omega_X <- pop_X[names(X)]
# omega_X is set to -1 if the distance between two regions is above gamma
omega_X[X > config1$gamma] <- -1
return(omega_X)
})
# The new value of gamma is equal to the maximum of dist_mat_stouffer + 1
gamma <- max(dist_mat_stouffer) + 1
# The values previously set to -1 are now set to the new value of gamma
dist_mat_stouffer[dist_mat_stouffer == -1] <- max(dist_mat_stouffer) * 2
We write the function cpp_stouffer_move_a
to replace the
default movement function cpp_move_a
. As the formula used
to compute the Stouffer’s rank values is similar to the power law
gravity models, we do not need to re-write cpp_log_like
,
the default function to compute the probability matrix. Other distance
models may require a rewriting of cpp_log_like
, which
should then be called in the new movement function. In the Stouffer’s
rank model, there is no parameter b, both spatial parameters are equal
to a. We use the package Rcpp to
source the new movement function.
// [[Rcpp::depends(o2geosocial)]]
#include <Rcpp.h>
#include <Rmath.h>
#include <o2geosocial.h>
// [[Rcpp::export()]]
Rcpp::List cpp_stouffer_move_a(Rcpp::List param, Rcpp::List data, Rcpp::List config,
Rcpp::RObject custom_ll, Rcpp::RObject custom_prior) {
// Import parameters
Rcpp::List new_param = clone(param);
double gamma = config["gamma"];
int max_kappa = config["max_kappa"];
Rcpp::String spatial = config["spatial_method"];
Rcpp::IntegerVector region = data["region"];
Rcpp::NumericMatrix distance = data["distance"];
Rcpp::NumericMatrix can_be_ances_reg = data["can_be_ances_reg"];
Rcpp::NumericVector population = data["population"];
Rcpp::NumericVector limits = config["prior_a"];
// Size of the probability matrix
Rcpp::List new_log_s_dens = new_param["log_s_dens"];
Rcpp::NumericMatrix probs = new_log_s_dens[0];
int nb_cases = pow(probs.size(), 0.5);
// Draw new value of a
Rcpp::NumericVector new_a = new_param["a"];
double sd_a = static_cast<double>(config["sd_a"]);
double old_logpost = 0.0, new_logpost = 0.0, p_accept = 0.0;
// proposal (normal distribution with SD: config$sd_a)
new_a[0] += R::rnorm(0.0, sd_a); // new proposed value
if (new_a[0] < limits[0] || new_a[0] > limits[1]) {
return param;
}
// Generate new probability matrix
new_param["log_s_dens"] = o2geosocial::cpp_log_like(population, distance,
can_be_ances_reg,
new_a[0], new_a[0],
max_kappa, gamma,
spatial, nb_cases);
// Compare old and new likelihood values
old_logpost = o2geosocial::cpp_ll_space(data, config, param,
R_NilValue, custom_ll);
new_logpost = o2geosocial::cpp_ll_space(data, config, new_param,
R_NilValue, custom_ll);
// Add prior values
old_logpost += o2geosocial::cpp_prior_a(param, config, custom_prior);
new_logpost += o2geosocial::cpp_prior_a(new_param, config, custom_prior);
// Accept or reject proposal
p_accept = exp(new_logpost - old_logpost);
if (p_accept < unif_rand()) {
return param;
}
return new_param;
}
We modify the element a of
the list moves
, and replace it with
cpp_stouffer_move_a
. As b
is not estimated in
this model, we create the null function f_null
, and modify
the list priors
.
moves3 <- custom_moves(a = cpp_stouffer_move_a)
f_null <- function(param) {
return(0.0)
}
priors3 <- custom_priors(b = f_null)
Finally, we set up the lists data
and
config
: the distance matrix for this run is
dist_mat_stouffer
; we do not move the parameter
b
, and we change the value of gamma
. We then
run out_stouffer
.
# Set data and config lists
data3 <- outbreaker_data(dates = dt_cases$Date, #Onset dates
age_group = dt_cases$age_group, #Age group
region = dt_cases$Cens_tract, #Location
genotype = dt_cases$Genotype, #Genotype
w_dens = w_dens, #Serial interval
f_dens = f_dens, #Latent period
a_dens = a_dens, #Age stratified contact matrix
population = pop_vect, #Population
distance = dist_mat_stouffer #Distance matrix
)
config3 <- create_config(data = data3,
gamma = gamma,
move_b = FALSE, # b is not estimated
init_b = 0,
spatial_method = "power-law",
n_iter = 5000, #Iteration number: main run
n_iter_import = 1500, #Iteration number: short run
burnin = 500, #burnin period: first run
outlier_relative = T, #Absolute / relative threshold
outlier_threshold = 0.9 #Value of the threshold
)
# Run the model using the Stouffer's rank method
out_stouffer <- outbreaker(data = data3, config = config3, moves = moves3,
priors = priors3, likelihoods = likelihoods)
As we did in the previous section, we plot the inferred cluster size distribution and compare it to the reference data (Figure 7). We observe discrepancies between the inferred distribution and the data: the model over-estimates the number of larger clusters (above 5 cases).
# Grouped cluster size distribution in the Stouffer's rank model
clust_infer_stouf <- summary(out_stouffer, burnin = burnin,
group_cluster = group_cluster)$cluster
# Merge inferred and reference cluster size distributions
clust_size_matrix <- rbind(clust_infer_stouf["Median",], h$counts)
# Plot the two distributions
b <- barplot(clust_size_matrix, names.arg = colnames(clust_infer_stouf),
beside = T)
# Add CIs
arrows(b[1,], clust_infer_stouf["1st Qu.",], b[1,],
clust_infer_stouf["3rd Qu.",], angle = 90, code = 3, length = 0.1)
legend("topright", fill = grey.colors(2), bty = "n",
legend = c("Inferred import status, Stouffer's rank method",
"Simulated dataset"))
Finally, we plot the proportion of perfect and close match for each case (Figure 8). We observe that the fit obtained with the Stouffer’s rank method is consistently worse than the first two models. The Stouffer’s rank method did not improve the agreement between the inferred trees and the reference data.
The reference data used in the study were simulated using an exponential gravity model, which explains why introducing the Stouffer’s rank method did not improve the inference. This is not representative of the performance of each mobility model on real-world data.
index_infer_stouf <- index_infer(dt_cases = dt_cases, out = out_stouffer,
burnin = config1$burnin)
index_clust_stouf <- index_clust(dt_cases = dt_cases, out = out_stouffer,
burnin = config1$burnin)
# Plot the sorted proportion in each model
par(bty = "n", mfrow = c(1, 2), mar = c(5,4,2,0), oma = c(0, 0, 0, 0))
# Panel A
plot(sort(index_infer_stouf), type = "l", xlab = "Case", ylab = "",
main = "A", las=1, col = grey.colors(2)[1], lwd = 3)
# Panel B
plot(sort(index_clust_stouf), type = "l", ylab = "Proportion", xlab = "Case",
main = "B", las=1, col = grey.colors(2)[1], lwd = 3)
The R package o2geosocial is a new tool for data analysis
building upon the framework developed in outbreaker2.
It is highly flexible and only requires routinely collected surveillance
data. It can be used to identify large transmission clusters and
highlight the dynamics of transmission in different regions. An
application of the algorithm on a small geographical scale was presented
in the tutorial, it can also be used to study datasets of cases
scattered across larger areas. The methods presented in the tutorial can
be applied to the second dataset included in the package
toy_outbreak_long
, which includes more than 1900 cases
simulated in the United States. The computation time increases with the
number of regions and the number of cases.
We showed how the model could be edited to implement a range of mobility models. Describing human mobility during infectious diseases outbreaks can be challenging, and the performances of the models depend on the setting studied. The package can be extended to take into account the variety of existing mobility models. We encourage the development of extensions and the use of o2geosocial to study different pathogens and settings where sequence data cannot be used to reconstruct transmission clusters. We hope that wider use of o2geosocial can help maximise what can be reconstructed from routinely collected data and epidemiological investigations to improve our understanding of outbreak dynamics.