Youth Risk Behavior Surveillance System (YRBSS)
The high school edition of the Behavioral Risk Factor Surveillance System (BRFSS).
One table with one row per sampled youth respondent.
A complex sample survey designed to generalize to all public and private school students in grades 9-12 in the United States.
Released biennially since 1993.
Administered by the Centers for Disease Control and Prevention.
Please skim before you begin:
This human-composed haiku or a bouquet of artificial intelligence-generated limericks
# maladolescence
# epidemiology
# sex, drugs, rock and roll
Download, Import, Preparation
Load the SAScii
library to interpret a SAS input program, and also re-arrange the SAS input program:
library(SAScii)
<-
sas_url "https://www.cdc.gov/healthyyouth/data/yrbs/files/2019/2019XXH-SAS-Input-Program.sas"
<- tolower( readLines( sas_url ) )
sas_text
# find the (out of numerical order)
# `site` location variable's position
# within the SAS input program
<- which( sas_text == '@1 site $3.' )
site_location
# find the start field's position
# within the SAS input program
<- which( sas_text == "input" )
input_location
# create a vector from 1 to the length of the text file
<- seq( length( sas_text ) )
sas_length
# remove the site_location
<- sas_length[ -site_location ]
sas_length
# re-insert the site variable's location
# immediately after the starting position
<-
sas_reorder c(
seq( input_location ) ] ,
sas_length[
site_location , seq( input_location + 1 , length( sas_length ) ) ]
sas_length[
)
# re-order the sas text file
<- sas_text[ sas_reorder ]
sas_text
<- tempfile()
sas_tf
writeLines( sas_text , sas_tf )
Download and import the national file:
<- tempfile()
dat_tf
<-
dat_url "https://www.cdc.gov/healthyyouth/data/yrbs/files/2019/XXH2019_YRBS_Data.dat"
download.file( dat_url , dat_tf , mode = 'wb' )
<- read.SAScii( dat_tf , sas_tf )
yrbss_df
names( yrbss_df ) <- tolower( names( yrbss_df ) )
'one' ] <- 1 yrbss_df[ ,
Save locally
Save the object at any point:
# yrbss_fn <- file.path( path.expand( "~" ) , "YRBSS" , "this_file.rds" )
# saveRDS( yrbss_df , file = yrbss_fn , compress = FALSE )
Load the same object:
# yrbss_df <- readRDS( yrbss_fn )
Survey Design Definition
Construct a complex sample survey design:
library(survey)
<-
yrbss_design svydesign(
~ psu ,
strata = ~ stratum ,
data = yrbss_df ,
weights = ~ weight ,
nest = TRUE
)
Variable Recoding
Add new columns to the data set:
<-
yrbss_design update(
yrbss_design , q2 = q2 ,
never_rarely_wore_seat_belt = as.numeric( qn8 == 1 ) ,
ever_used_marijuana = as.numeric( qn45 == 1 ) ,
tried_to_quit_tobacco_past_year = as.numeric( q39 == 2 ) ,
used_tobacco_past_year = as.numeric( q39 > 1 )
)
Analysis Examples with the survey
library
Unweighted Counts
Count the unweighted number of records in the survey sample, overall and by groups:
sum( weights( yrbss_design , "sampling" ) != 0 )
svyby( ~ one , ~ ever_used_marijuana , yrbss_design , unwtd.count )
Weighted Counts
Count the weighted size of the generalizable population, overall and by groups:
svytotal( ~ one , yrbss_design )
svyby( ~ one , ~ ever_used_marijuana , yrbss_design , svytotal )
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
svymean( ~ bmipct , yrbss_design , na.rm = TRUE )
svyby( ~ bmipct , ~ ever_used_marijuana , yrbss_design , svymean , na.rm = TRUE )
Calculate the distribution of a categorical variable, overall and by groups:
svymean( ~ q2 , yrbss_design , na.rm = TRUE )
svyby( ~ q2 , ~ ever_used_marijuana , yrbss_design , svymean , na.rm = TRUE )
Calculate the sum of a linear variable, overall and by groups:
svytotal( ~ bmipct , yrbss_design , na.rm = TRUE )
svyby( ~ bmipct , ~ ever_used_marijuana , yrbss_design , svytotal , na.rm = TRUE )
Calculate the weighted sum of a categorical variable, overall and by groups:
svytotal( ~ q2 , yrbss_design , na.rm = TRUE )
svyby( ~ q2 , ~ ever_used_marijuana , yrbss_design , svytotal , na.rm = TRUE )
Calculate the median (50th percentile) of a linear variable, overall and by groups:
svyquantile( ~ bmipct , yrbss_design , 0.5 , na.rm = TRUE )
svyby(
~ bmipct ,
~ ever_used_marijuana ,
yrbss_design ,
svyquantile , 0.5 ,
ci = TRUE , na.rm = TRUE
)
Estimate a ratio:
svyratio(
numerator = ~ tried_to_quit_tobacco_past_year ,
denominator = ~ used_tobacco_past_year ,
yrbss_design ,na.rm = TRUE
)
Subsetting
Restrict the survey design to youths who ever drank alcohol:
<- subset( yrbss_design , qn40 > 1 ) sub_yrbss_design
Calculate the mean (average) of this subset:
svymean( ~ bmipct , sub_yrbss_design , na.rm = TRUE )
Measures of Uncertainty
Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:
<- svymean( ~ bmipct , yrbss_design , na.rm = TRUE )
this_result
coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )
<-
grouped_result svyby(
~ bmipct ,
~ ever_used_marijuana ,
yrbss_design ,
svymean ,na.rm = TRUE
)
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )
Calculate the degrees of freedom of any survey design object:
degf( yrbss_design )
Calculate the complex sample survey-adjusted variance of any statistic:
svyvar( ~ bmipct , yrbss_design , na.rm = TRUE )
Include the complex sample design effect in the result for a specific statistic:
# SRS without replacement
svymean( ~ bmipct , yrbss_design , na.rm = TRUE , deff = TRUE )
# SRS with replacement
svymean( ~ bmipct , yrbss_design , na.rm = TRUE , deff = "replace" )
Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See ?svyciprop
for alternatives:
svyciprop( ~ never_rarely_wore_seat_belt , yrbss_design ,
method = "likelihood" , na.rm = TRUE )
Regression Models and Tests of Association
Perform a design-based t-test:
svyttest( bmipct ~ never_rarely_wore_seat_belt , yrbss_design )
Perform a chi-squared test of association for survey data:
svychisq(
~ never_rarely_wore_seat_belt + q2 ,
yrbss_design )
Perform a survey-weighted generalized linear model:
<-
glm_result svyglm(
~ never_rarely_wore_seat_belt + q2 ,
bmipct
yrbss_design
)
summary( glm_result )
Replication Example
This example matches statistics, standard errors, and confidence intervals from the “never/rarely wore seat belt” row of PDF page 29 of this CDC analysis software document:
<-
unwtd_count_result unwtd.count( ~ never_rarely_wore_seat_belt , yrbss_design )
stopifnot( coef( unwtd_count_result ) == 11149 )
<-
wtd_n_result svytotal(
~ one ,
subset(
yrbss_design , !is.na( never_rarely_wore_seat_belt )
)
)
stopifnot( round( coef( wtd_n_result ) , 0 ) == 12132 )
<-
share_result svymean(
~ never_rarely_wore_seat_belt ,
yrbss_design ,na.rm = TRUE
)
stopifnot( round( coef( share_result ) , 4 ) == .0654 )
stopifnot( round( SE( share_result ) , 4 ) == .0065 )
<-
ci_result svyciprop(
~ never_rarely_wore_seat_belt ,
yrbss_design , na.rm = TRUE ,
method = "beta"
)
stopifnot( round( confint( ci_result )[1] , 4 ) == 0.0529 )
stopifnot( round( confint( ci_result )[2] , 2 ) == 0.08 )
Analysis Examples with srvyr
The R srvyr
library calculates summary statistics from survey data, such as the mean, total or quantile using dplyr-like syntax. srvyr allows for the use of many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, the tidyverse
style of non-standard evaluation and more consistent return types than the survey
package. This vignette details the available features. As a starting point for YRBSS users, this code replicates previously-presented examples:
library(srvyr)
<- as_survey( yrbss_design ) yrbss_srvyr_design
Calculate the mean (average) of a linear variable, overall and by groups:
%>%
yrbss_srvyr_design summarize( mean = survey_mean( bmipct , na.rm = TRUE ) )
%>%
yrbss_srvyr_design group_by( ever_used_marijuana ) %>%
summarize( mean = survey_mean( bmipct , na.rm = TRUE ) )