NHANES aims to produce national estimates on a range of health, nutrition, and other factors that accurately represent the non-institutionalized civilian U.S. population. The data is gathered using a complex, four-stage sampling design. When analyzing this data, it’s crucial to use specialized survey analysis techniques that incorporate specific weights. These weights account for the intricate sampling strategy employed during data collection.
You can find the NHANES sample weights as part of the downloadable data. For a detailed understanding of these weights and their correct usage, the CDC provides extensive documentation: CDC NHANES Tutorials.
It’s worth noting that the methodologies and strategies for NHANES have evolved over time. Reasons for these changes range from methodological enhancements, external events like the COVID epidemic, to a recent shift towards a more longitudinal perspective. Such changes can significantly influence any analysis. Thus, analysts should approach this data with diligence and care.
There are numerous online resources that offer worked examples for further understanding:
survey
package: R-Survey.This vignette’s intent is not to serve as an exhaustive guide. Instead, we aim to showcase some fundamental functions and point you towards more in-depth online resources necessary for specialized analyses. Here we will cover the basics of setting up the survey design and performing basic tasks. For those interested in diving deeper, we encourage you to develop and share more comprehensive documents.
Using the nhanesA
and survey
packages, we
will explore the average blood pressure of individuals over 40 years of
age, segmented by their reported ethnicity, during the 2017-2018 cycle.
To achieve this, we’ll merge the demographic (DEMO_J
) and
blood pressure (BXP_J
) datasets. Our examination will cover
average blood pressure distributions across different sexes and
ethnicities, and we’ll conduct a regression analysis to understand the
relationship between blood pressure and age. This is a basic
demonstration; comprehensive analyses would need to consider various
risk factors and align with specific epidemiological queries.
## [1] 9254 46
## merge DEMO_J and BXP_J using SEQN.
bpxj = nhanes("BPX_J")
data = merge(demoj, bpxj, by="SEQN")
dim(data)
## [1] 8704 66
To generate accurate estimates, it’s essential to establish a survey design object that integrates the weights into our analysis. This design object should be created before any subsetting or manipulation of the data. By doing this, we ensure that intricate survey design elements, like stratification and clustering, are fully captured and applied across the dataset.
It is also crucial to choose the appropriate weighting scheme for your analysis. NHANES offers various weights, such as interview, examination, fasting, and MEC laboratory weights, to name a few. The selection hinges on the specific dataset components and analyses you’re focusing on. Moreover, when analyzing data across multiple cycles, it will be necessary to modify these weights to ensure accurate representation and inference.
For a comprehensive understanding, the CDC provides in-depth guidance on its website: CDC NHANES Weighting Tutorial.
Next, we subset the data to focus on individuals over 40 years of
age. At this stage, it’s imperative to use the tools in the
survey
package for any data manipulations. This procedure
guarantees that weight adjustments are correctly applied, ensuring the
validity of subsequent analyses. For comparison purposes, we also create
an additional subset without the survey design framework, which allows
for easy examination of unadjusted values.
Next, we’ll explore the variable RIDRETH1 from the DEMO_J table, which represents reported ethnicity. We’ll focus on diastolic blood pressure for our analysis. For simplicity in our presentation, we’ll utilize only the first measurement, represented by the variable BPXDI1 in the BPX_J table.
First we split by ethnicity and calculate mean blood pressure for each group on the unweighted data.
## [1] 73.04455
##split the data by ethnicity and calculate mean of the unweighted data
unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
unweighted_means
## Mexican American Other Hispanic Non-Hispanic White
## 72.41000 72.97611 70.84130
## Non-Hispanic Black Other Race - Including Multi-Racial
## 75.71466 74.41311
Now, we perform the same analysis using the survey weights.
## mean SE
## BPXDI1 73.237 0.5597
# By ethnicity
adjmnsbyEth = svyby(~BPXDI1, ~RIDRETH1, dfsub, svymean, na.rm=TRUE)
weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)
combined_data <- cbind(unweighted_means, weighted_means)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data
## Unweighted Weighted
## Mexican American 72.41000 74.03194
## Other Hispanic 72.97611 74.73756
## Non-Hispanic White 70.84130 72.45422
## Non-Hispanic Black 75.71466 75.71874
## Other Race - Including Multi-Racial 74.41311 74.60215
We can do this with gender and any other categorical variable:
# By Gender
mns = sapply(split(datasub$BPXDI1, datasub$RIAGENDR), mean, na.rm=TRUE)
adjmnsbyGen = svyby(~BPXDI1, ~RIAGENDR, dfsub, svymean, na.rm=TRUE)
combined_data <- cbind(mns, adjmnsbyGen$BPXDI1)
colnames(combined_data) <- c("Unweighted", "Weighted")
combined_data
## Unweighted Weighted
## Male 74.67330 75.31134
## Female 71.43385 71.31453
In addition to computing the mean, various other summary statistics, such as quantiles and variance, can be readily calculated using the survey design object. Importantly, each of these summary statistics will come with a corresponding standard error that accounts for the weights, providing insights into the precision of the estimates.
## 25% 50% 75%
## 66 74 82
# For the survey weighted data
svyquantile(~BPXDI1, dfsub, quantiles = c(0.25,0.5,0.75), na.rm=TRUE)
## $BPXDI1
## quantile ci.2.5 ci.97.5 se
## 0.25 66 66 68 0.4691643
## 0.5 74 74 76 0.4691643
## 0.75 80 80 82 0.4691643
##
## attr(,"hasci")
## [1] TRUE
## attr(,"class")
## [1] "newsvyquantile"
## RIAGENDR BPXDI1 se.BPXDI1
## Male 76 0.9383286
## Female 72 0.4691643
Note: The columns ci.2.5 and ci.97.5 in the output represent percentile intervals, not the traditional confidence intervals. Specifically, for a given quantile (like the 25th percentile), the values in these columns denote the range between the 2.5th percentile and the 97.5th percentile of that particular quantile estimate, when considering the survey weights.
## variance SE
## BPXDI1 173.68 12.279
## RIDRETH1 BPXDI1 se
## Mexican American 200.8781 45.43074
## Other Hispanic 160.0157 23.79938
## Non-Hispanic White 168.3504 18.30355
## Non-Hispanic Black 202.3565 37.12593
## Other Race - Including Multi-Racial 156.9968 17.26037
## RIAGENDR BPXDI1 se
## Male 141.2704 10.52796
## Female 196.1199 20.65706
The svyglm
function allows us to perform regression
analyses while accounting for the survey’s design. In this section, we
will demonstrate how to model diastolic blood pressure levels based on
age and ethnicity.
svyglm
Using svyglm()
is akin to utilizing the
glm()
function from base R. The key difference is that
svyglm()
is specifically tailored to work with survey
objects, considering the intricate sampling design and weights.
To model diastolic blood pressure (noted as BPXDI1
in
our dataset) as a function of age (RIDAGEYR
) and ethnicity
(RIDRETH3
), we can construct our model as outlined below.
The resulting output will furnish regression coefficients, standard
errors, and significance tests that have been adjusted for the complex
survey design. When interpreting these results, it’s crucial to remember
that they resemble outputs from a generalized linear model
(glm
). However, the estimates provided by
svyglm()
have been weighted to produce nationally
representative conclusions.
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())
summary(weighted_model)
## Call:
##
## svyglm(formula = BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub,
## family = gaussian())
##
## Survey design:
## subset(nhanesDesign, data$RIDAGEYR >= 40)
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 90.96516 1.24720 72.935
## RIDAGEYR -0.31522 0.01853 -17.012
## RIDRETH1Other Hispanic 1.43894 1.25518 1.146
## RIDRETH1Non-Hispanic White 0.49810 0.73132 0.681
## RIDRETH1Non-Hispanic Black 2.92332 1.09594 2.667
## RIDRETH1Other Race - Including Multi-Racial 1.38532 0.63780 2.172
## Pr(>|t|)
## (Intercept) 5.73e-15 ***
## RIDAGEYR 1.04e-08 ***
## RIDRETH1Other Hispanic 0.2783
## RIDRETH1Non-Hispanic White 0.5113
## RIDRETH1Non-Hispanic Black 0.0236 *
## RIDRETH1Other Race - Including Multi-Racial 0.0550 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 162.6653)
##
## Number of Fisher Scoring iterations: 2
Visualizing the differences between weighted and unweighted analyses can offer a clear, intuitive understanding of the impact of weights. We can use bar plots to compare mean diastolic blood pressures across different ethnicities. This plot could be extended to other metrics as well.
library(ggplot2)
## recalculating means (same as above)
unweighted_means <- sapply(split(datasub$BPXDI1, datasub$RIDRETH1), mean, na.rm=TRUE)
weighted_means <- as.numeric(adjmnsbyEth$BPXDI1)
plot_data <- data.frame(
Ethnicity = factor(rep(names(unweighted_means), 2),
levels = names(unweighted_means)),
Means = c(unweighted_means, weighted_means),
Type = factor(c(rep("Unweighted", length(unweighted_means)),
rep("Weighted", length(weighted_means))),
levels = c("Unweighted", "Weighted"))
)
# Creating the plot
ggplot(plot_data, aes(x = Ethnicity, y = Means, fill = Type)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
labs(title = "Comparison of Diastolic Blood Pressure by Ethnicity",
y = "Mean Diastolic Blood Pressure") +
scale_x_discrete(labels = c('Mex-Amer','Other Hisp','White', 'Black', 'Other')) +
scale_fill_manual(values = c("blue", "red")) +
theme_minimal() +
theme(legend.title = element_blank())
Visualizing regression coefficients can provide a clear understanding of how different variables impact the outcome, especially when using weighted and unweighted data.
unweighted_model <- glm(BPXDI1 ~ RIDAGEYR + RIDRETH1, data = datasub, family = gaussian())
weighted_model <- svyglm(BPXDI1 ~ RIDAGEYR + RIDRETH1, design = dfsub, family = gaussian())
# For the unweighted model
unweighted_summary <- summary(unweighted_model)
unweighted_coefs <- unweighted_summary$coefficients[, "Estimate"]
unweighted_se <- unweighted_summary$coefficients[, "Std. Error"]
# For the weighted model
weighted_summary <- summary(weighted_model)
weighted_coefs <- as.numeric(weighted_summary$coefficients[, "Estimate"])
weighted_se <- as.numeric(weighted_summary$coefficients[, "Std. Error"])
comparison <- data.frame(
Variable = names(unweighted_coefs),
Unweighted = unweighted_coefs,
Weighted = as.numeric(weighted_coefs)
)
print(comparison)
## Variable Unweighted Weighted
## 1 (Intercept) 92.2848149 90.9651590
## 2 RIDAGEYR -0.3460552 -0.3152158
## 3 RIDRETH1Other Hispanic 1.2325637 1.4389441
## 4 RIDRETH1Non-Hispanic White 0.7553912 0.4981020
## 5 RIDRETH1Non-Hispanic Black 4.2996735 2.9233165
## 6 RIDRETH1Other Race - Including Multi-Racial 2.0667520 1.3853204
unweighted_df <- data.frame(Variable = names(unweighted_coefs),
Estimate = unweighted_coefs,
SE = unweighted_se,
Type = "Unweighted")
weighted_df <- data.frame(Variable = names(unweighted_coefs),
Estimate = weighted_coefs,
SE = weighted_se,
Type = "Weighted")
plot_data <- rbind(unweighted_df, weighted_df)
ggplot(subset(plot_data, Variable!='(Intercept)'), aes(x = Estimate, y = reorder(Variable, Estimate), color = Type)) +
geom_point(position = position_dodge(0.5), size = 2.5) +
geom_errorbarh(aes(xmin = Estimate - SE, xmax = Estimate + SE),
height = 0.2, position = position_dodge(0.5)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "grey50") +
labs(title = "Comparison of Regression Coefficients",
x = "Coefficient Value", y = "Predictors") +
theme_minimal() +
scale_color_manual(values = c("Unweighted" = "blue", "Weighted" = "red")) +
theme(legend.title = element_blank())