We’ll demonstrate how to use the package with data from the
Framingham Heart Study. The following information is from the official
Framingham study documentation (https://biolincc.nhlbi.nih.gov/teaching/):
“The Framingham Heart Study is a long term prospective study of the
etiology of cardiovascular disease among a population of free living
subjects in the community of Framingham, Massachusetts. The Framingham
Heart Study was a landmark study in epidemiology in that it was the
first prospective study of cardiovascular disease and identified the
concept of risk factors and their joint effects. The study began in 1948
and 5,209 subjects were initially enrolled in the study. Participants
have been examined biennially since the inception of the study and all
subjects are continuously followed through regular surveillance for
cardiovascular outcomes. Clinic examination data has included
cardiovascular disease risk factors and markers of disease such as blood
pressure, blood chemistry, lung function, smoking history, health
behaviors, ECG tracings, Echocardiography, and medication use. Through
regular surveillance of area hospitals, participant contact, and death
certificates, the Framingham Heart Study reviews and adjudicates events
for the occurrence of Angina Pectoris, Myocardial Infarction, Heart
Failure, and Cerebrovascular disease.
NOTE: This is a “teaching” dataset. Specific methods were employed to
ensure an anonymous dataset that protects patient confidentiality;
therefore, this dataset is inappropriate for publication purposes.” The
use of these data for the purposes of this package were approved on
11Mar2019 (request #7161) by NIH/NHLBI.
Binary outcome example
Research question: what is the effect of having diabetes at the
beginning of the study on the 24-year risk of cardiovascular disease or
death due to any cause?
Here, we will estimate the risk difference, risk ratio, odds ratio,
and number needed to treat, adjusting for patient’s age, sex, body mass
index (BMI), smoking status (current smoker or not), and prevalence of
hypertension (if they are hypertensive or not at baseline). Logistic
regression is used as the underlying parametric model for
g-computation.
## Specify the regression formula
cvdd.formula <- cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP
## For reproducibility, we should always set the seed since the g-computation uses
## random resampling of the data to calculate confidence intervals and random
## sampling of the distribution when predicting outcomes.
set.seed(1298)
## Call the gComp function
binary.res <- gComp(data = cvdd,
formula = cvdd.formula,
outcome.type = "binary",
R = 1000)
binary.res
#> Formula:
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP
#>
#> Parameter estimates:
#> DIABETES1_v._DIABETES0 Estimate (95% CI)
#> Risk Difference 0.287 (0.196, 0.395)
#> Risk Ratio 1.700 (1.476, 1.966)
#> Odds Ratio 4.550 (2.771, 9.085)
#> Number needed to treat/harm 3.484
The result obtained from the gComp
function is an object
of class gComp which is a list containing the summary
results, results.df
, n
, R
,
boot.result
, contrast
, family
,
formula
, predicted.outcome
, and
glm.result
(see ?gComp
or
help(gComp)
for a more detailed explanation of each item in
the list).
class(binary.res)
#> [1] "gComp" "list"
## The names of the different items in the list:
names(binary.res)
#> [1] "summary" "results.df"
#> [3] "n" "R"
#> [5] "boot.result" "contrast"
#> [7] "family" "formula"
#> [9] "predicted.outcome" "glm.result"
## Sample size of the original data:
binary.res$n
#> [1] 4240
## Contrast being compared in the analysis:
binary.res$contrast
#> [1] "DIABETES1 v. DIABETES0"
There is also a summary method for objects with class
gComp that contains the formula, family and link
function, contrast being made, parameter estimates with 95% CIs, and a
summary of the underlying glm used for predictions.
summary(binary.res)
#> Formula:
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP
#>
#> Family: binomial
#> Link function: logit
#>
#> Contrast: DIABETES1 v. DIABETES0
#>
#> Parameter estimates:
#> DIABETES1_v._DIABETES0 Estimate (95% CI)
#> Risk Difference 0.287 (0.196, 0.395)
#> Risk Ratio 1.700 (1.476, 1.966)
#> Odds Ratio 4.550 (2.771, 9.085)
#> Number needed to treat/harm 3.484
#>
#> Underlying glm:
#> Call: stats::glm(formula = formula, family = family, data = working.df,
#> na.action = stats::na.omit)
#>
#> Coefficients:
#> (Intercept) DIABETES1 AGE SEX1
#> -6.25839 1.51501 0.10246 -0.79405
#> BMI CURSMOKE1 PREVHYP1
#> 0.02512 0.58550 0.77804
#>
#> Degrees of Freedom: 4220 Total (i.e. Null); 4214 Residual
#> (19 observations deleted due to missingness)
#> Null Deviance: 5735
#> Residual Deviance: 4697 AIC: 4711
Checking model fit
The 95% CIs obtained from the riskCommunicator package represent
population-standardized marginal effects obtained with g-computation. To
ensure that the parameter estimates from each bootstrap iteration are
normally distributed, we can also look at the histogram and Q-Q plots of
bootstrapped estimates by calling:
The histograms show the different effect estimates obtained by each
bootstrap resampling of the data and should be normally distributed if
the model is correctly specified. Q-Q plots help to verify that the
bootstrap values are normally distributed by comparing the actual
distribution of bootstrap values against a theoretical normal
distribution of values centered at mean = 0. If the estimates are
normally distributed, the plotted estimates (black circles) should
overlay the diagonal red dashed line.
In the manuscript, we compare the results of gComp to standard
regression models. Here, we show how to do that so we can re-create
Table 2.
First we obtain the odds ratio using logistic regression.
#>
#> Call: glm(formula = cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE +
#> PREVHYP, family = binomial(link = "logit"), data = cvdd)
#>
#> Coefficients:
#> (Intercept) DIABETES1 AGE SEX1
#> -6.25839 1.51501 0.10246 -0.79405
#> BMI CURSMOKE1 PREVHYP1
#> 0.02512 0.58550 0.77804
#>
#> Degrees of Freedom: 4220 Total (i.e. Null); 4214 Residual
#> (19 observations deleted due to missingness)
#> Null Deviance: 5735
#> Residual Deviance: 4697 AIC: 4711
Note the use of confint.default
in the above call. The
typical call confint
does not return Wald-based CIs, so
we’ve forced it with confint.default.
You can read more
about that here: https://stats.stackexchange.com/questions/5304/why-is-there-a-difference-between-manually-calculating-a-logistic-regression-95
Next, we calculate the risk ratio using a Poisson approximation of
log-binomial regression with robust variance (sandwich standard
errors).
#>
#> Call: glm(formula = cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE +
#> PREVHYP, family = "poisson", data = cvdd %>% mutate(cvd_dth = ifelse(cvd_dth ==
#> "0", 0, ifelse(cvd_dth == "1", 1, NA))))
#>
#> Coefficients:
#> (Intercept) DIABETES1 AGE SEX1
#> -3.86603 0.39673 0.04930 -0.38348
#> BMI CURSMOKE1 PREVHYP1
#> 0.01403 0.26329 0.35452
#>
#> Degrees of Freedom: 4220 Total (i.e. Null); 4214 Residual
#> (19 observations deleted due to missingness)
#> Null Deviance: 3079
#> Residual Deviance: 2539 AIC: 6073
We can try to calculate the risk difference using a log-linear
regression, but the model won’t converge.
std.reg.rd = glm(formula = cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP,
data = cvdd %>%
## To use linear regression, we need to change DIABETES from a
## factor to a numeric (0,1) variable
mutate(cvd_dth = ifelse(cvd_dth == "0", 0,
ifelse(cvd_dth == "1", 1, NA))),
family=gaussian(link = 'log'))
We can re-create Table 2 from the manuscript now!
# combine standard regression results
std.regression.res = df.std.reg.or %>%
mutate(Parameter = "Odds Ratio",
Std_regression = paste0(round(OR, 2),
" (",
round(LL, 2),
", ",
round(UL, 2),
")")) %>%
bind_rows(df.std.reg.rr %>%
mutate(Parameter = "Risk Ratio",
Std_regression = paste0(round(Estimate, 2),
" (",
round(LL, 2),
", ",
round(UL, 2),
")"))) %>%
select(Parameter, Std_regression) %>%
rename(`Standard regression` = Std_regression)
rownames(std.regression.res) = NULL
(table2 = binary.res$results.df %>%
mutate(riskCommunicator = paste0(format(round(Estimate, 2), 2),
" (",
format(round(`2.5% CL`, 2), 2),
", ",
format(round(`97.5% CL`, 2), 2),
")"),
riskCommunicator = ifelse(Parameter == "Number needed to treat/harm",
round(Estimate, 2), riskCommunicator)) %>%
select(Parameter, riskCommunicator) %>%
left_join(std.regression.res, by = "Parameter"))
Risk Difference |
0.29 (0.20, 0.39) |
NA |
Risk Ratio |
1.70 (1.48, 1.97) |
1.49 (1.33, 1.66) |
Odds Ratio |
4.55 (2.77, 9.09) |
4.55 (2.66, 7.78) |
Number needed to treat/harm |
3.48 |
NA |
Rate outcome example
Research question: what is the effect of having diabetes at the
beginning of the study on the rate of cardiovascular disease or death
due to any cause?
Here, we will estimate the rate difference and rate ratio, adjusting
for patient’s age, sex, body mass index (BMI), smoking status (current
smoker or not), and prevalence of hypertension (if they are hypertensive
or not at baseline). We have included timeout as the
offset
and a rate.multiplier
of
365.25*100
so that the estimates are returned with units of
100 person-years. Poisson regression is used as the underlying
parametric model for g-computation. (Note: for overdispersed count/rate
outcomes, the negative binomial distribution can be specified by setting
outcome.type
to “count_nb” or
“rate_nb”.)
## Modify the dataset to change the variable cvd_dth from a factor
## to a numeric variable since the outcome for Poisson
## regression must be numeric.
cvdd.t <- cvdd %>%
dplyr::mutate(cvd_dth = as.numeric(as.character(cvd_dth)),
timeout = as.numeric(timeout))
set.seed(6534)
rate.res <- gComp(data = cvdd.t,
Y = "cvd_dth",
X = "DIABETES",
Z = c("AGE", "SEX", "BMI", "CURSMOKE", "PREVHYP"),
outcome.type = "rate",
rate.multiplier = 365.25*100,
offset = "timeout",
R = 1000)
rate.res
#> Formula:
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP + offset(log(timeout_adj))
#>
#> Parameter estimates:
#> DIABETES1_v._DIABETES0 Estimate (95% CI)
#> Incidence Rate Difference 2.189 (1.436, 3.063)
#> Incidence Rate Ratio 1.913 (1.603, 2.288)
Rate outcome with subgroups example
Research question: what is the effect of having diabetes at the
beginning of the study on the rate of cardiovascular disease or death
due to any cause, stratified by sex?
Here, we will estimate the same effects above, but in subgroups
defined by sex.
rate.res.subgroup <- gComp(data = cvdd.t,
Y = "cvd_dth",
X = "DIABETES",
Z = c("AGE", "SEX", "BMI", "CURSMOKE", "PREVHYP"),
subgroup = "SEX",
outcome.type = "rate",
rate.multiplier = 365.25*100,
offset = "timeout",
R = 1000)
rate.res.subgroup
#> Formula:
#> cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP + DIABETES:SEX + offset(log(timeout_adj))
#>
#> Parameter estimates:
#> DIABETES1_v._DIABETES0_SEX0 Estimate (95% CI)
#> Incidence Rate Difference 2.488 (1.340, 4.058)
#> Incidence Rate Ratio 1.794 (1.423, 2.301)
#> DIABETES1_v._DIABETES0_SEX1 Estimate (95% CI)
#> Incidence Rate Difference 1.918 (0.995, 3.322)
#> Incidence Rate Ratio 2.044 (1.541, 2.860)
Let’s make a figure comparing the results we got above using
riskCommunicator with those we get using standard regression models for
this rate example.
First, we need to get the covariate-conditional estimates using
standard Poisson regression.
# Standard Poisson regression (spr) for the rate question, across both sexes
spr.rate = glm(
formula = cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP +
offset(log(timeout+0.001)),
data = cvdd.t,
family = "poisson"
)
# get the parameter estimate and CI from the model object
df.spr.rate = as.data.frame(
exp(cbind(Estimate = coef(spr.rate), confint.default(spr.rate, level = 0.95)))
) %>%
rename(`2.5% CL` = `2.5 %`, `97.5% CL` = `97.5 %`) %>%
filter(rownames(.) == "DIABETES1") %>%
mutate(Subgroup = "All")
# Standard Poisson regression (spr) for the rate question, by subgroup (SEX)
spr.rate.subgroup = glm(
formula = cvd_dth ~ DIABETES + AGE + SEX + BMI + CURSMOKE + PREVHYP +
DIABETES*SEX + offset(log(timeout+0.001)),
data = cvdd.t,
family = "poisson"
)
# Get the variance-covariance matrix so we can calculate standard errors
se.subgroup = vcov(spr.rate.subgroup)
# Get estimates and CIs
df.spr.rate.subgroup = data.frame(
Subgroup = c("Male", "Female"),
raw.est = c(coef(spr.rate.subgroup)[2],
coef(spr.rate.subgroup)[2] + coef(spr.rate.subgroup)[8]),
SE = c(sqrt(se.subgroup[2,2]),
sqrt(se.subgroup[2,2] + se.subgroup[8,8] + 2*se.subgroup[2,8]))) %>%
mutate(Estimate = exp(raw.est),
`2.5% CL` = exp(raw.est - 1.96 * SE),
`97.5% CL` = exp(raw.est + 1.96 * SE))
# Combine the results from the subgroup and full model into a single data.frame
combined.std.reg = df.spr.rate %>%
bind_rows(df.spr.rate.subgroup %>%
select(Subgroup, Estimate:`97.5% CL`)) %>%
mutate(Parameter = "Incidence Rate Ratio",
model = "Standard Poisson Regression")
Now, we can plot the same figure shown in the manuscript.
# Combine the riskCommunicator results with the standard Poisson regression results
df.combined = rate.res$results.df %>%
mutate(Subgroup = "All") %>%
bind_rows(df) %>%
mutate(model = "riskCommunicator") %>%
select(-Outcome, -Comparison) %>%
bind_rows(combined.std.reg) %>%
mutate(hline = ifelse(Parameter == "Incidence Rate Ratio", 1, 0))
rate.diff = ggplot(df.combined %>%
filter(Parameter == "Incidence Rate Difference"),
aes(x = Subgroup, y = Estimate, color = model)) +
geom_point(size = 2, position = position_dodge(width = .5)) +
geom_errorbar(aes(ymin = `2.5% CL`, ymax = `97.5% CL`),
width = 0.2,
position = position_dodge(width = .5)) +
scale_color_manual(values = c("#481567FF", "#3CBB75FF")) +
theme_bw() +
labs(x = "",
y = str_wrap('Incidence rate difference of
cardiovascular disease or death
(cases/100 person-years)', width = 32),
color = "") +
geom_hline(aes(yintercept = hline),
color = "red",
linetype = "dashed",
alpha = 0.3) +
theme(legend.position = "none")
rate.ratio = ggplot(df.combined %>%
filter(Parameter == "Incidence Rate Ratio"),
aes(x = Subgroup, y = Estimate, color = model)) +
geom_point(size = 2, position = position_dodge(width = .5)) +
geom_errorbar(aes(ymin = `2.5% CL`, ymax = `97.5% CL`),
width = 0.2,
position = position_dodge(width = .5)) +
scale_y_continuous(trans = "log2") +
scale_color_manual(values = c("#481567FF", "#3CBB75FF")) +
theme_bw() +
labs(x = "",
y = str_wrap('Incidence rate ratio of
cardiovascular disease or death
(shown on natural log scale)', width = 32),
color = "") +
geom_hline(aes(yintercept = hline),
color = "red",
linetype = "dashed",
alpha = 0.3) +
theme(legend.position = "bottom")
ggarrange(rate.ratio, rate.diff, ncol = 2, common.legend = T, legend = "bottom",
labels = c("A", "B"), widths = c(1, 1))
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets
#> [6] methods base
#>
#> other attached packages:
#> [1] ggpubr_0.6.0 sandwich_3.1-1
#> [3] formatR_1.14 printr_0.3
#> [5] lubridate_1.9.3 forcats_1.0.0
#> [7] stringr_1.5.1 dplyr_1.1.4
#> [9] purrr_1.0.2 readr_2.1.5
#> [11] tidyr_1.3.1 tibble_3.2.1
#> [13] ggplot2_3.5.1 tidyverse_2.0.0
#> [15] riskCommunicator_1.0.1 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.49 bslib_0.8.0
#> [4] rstatix_0.7.2 lattice_0.22-6 tzdb_0.4.0
#> [7] vctrs_0.6.5 tools_4.4.2 generics_0.1.3
#> [10] fansi_1.0.6 pkgconfig_2.0.3 lifecycle_1.0.4
#> [13] farver_2.1.2 compiler_4.4.2 munsell_0.5.1
#> [16] codetools_0.2-20 carData_3.0-5 htmltools_0.5.8.1
#> [19] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
#> [22] yaml_2.3.10 Formula_1.2-5 pillar_1.9.0
#> [25] car_3.1-3 jquerylib_0.1.4 MASS_7.3-61
#> [28] cachem_1.1.0 boot_1.3-31 abind_1.4-8
#> [31] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
#> [34] labeling_0.4.3 maketools_1.3.1 cowplot_1.1.3
#> [37] fastmap_1.2.0 grid_4.4.2 colorspace_2.1-1
#> [40] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4
#> [43] broom_1.0.7 withr_3.0.2 scales_1.3.0
#> [46] backports_1.5.0 timechange_0.3.0 gridExtra_2.3
#> [49] ggsignif_0.6.4 zoo_1.8-12 hms_1.1.3
#> [52] evaluate_1.0.1 knitr_1.49 rlang_1.1.4
#> [55] glue_1.8.0 jsonlite_1.8.9 R6_2.5.1