Trend Analysis of Complex Survey Data
The purpose of this analysis is to make statistically valid statements such as, “there was a significant linear decrease in the prevalence of high school aged americans who have ever smoked a cigarette across the period 1999-2011” with complex sample survey data.
This step-by-step walkthrough exactly reproduces the statistics presented in the Center for Disease Control & Prevention’s (CDC) linear trend analysis. This analysis may complement qualitative evaluation on prevalence changes observed from surveillance data by providing quantitative evidence, such as when a joinpoint (also called breakpoint or changepoint) occurred; however, this analysis does not explain why or how changes in trends occur.
Download, Import, Preparation
Download and import the multi-year stacked file:
library(SAScii)
library(readr)
<-
sas_url "https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2019/2019-SADC-SAS-Input-Program.sas"
<-
dat_url "https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2019/sadc_2019_national.dat"
<-
sas_positions parse.SAScii( sas_url )
'varname' ] <-
sas_positions[ , tolower( sas_positions[ , 'varname' ] )
<-
variables_to_keep c( "sex" , "grade" , "race4" , "q30" , "year" , "psu" , "stratum" , "weight" )
'column_types' ] <-
sas_positions[ , ifelse( !( sas_positions[ , 'varname' ] %in% variables_to_keep ) , "_" ,
ifelse( sas_positions[ , 'char' ] , "c" , "d" ) )
<-
yrbss_tbl read_fwf(
dat_url ,fwf_widths(
abs( sas_positions[ , 'width' ] ) ,
col_names = sas_positions[ , 'varname' ]
) ,col_types = paste0( sas_positions[ , 'column_types' ] , collapse = "" ) ,
na = c( "" , "." )
)
<- data.frame( yrbss_tbl ) yrbss_df
Restrict the dataset to only years shown in the original analysis and re-name the main variable:
<- subset( yrbss_df , year %in% seq( 1991 , 2011 , 2 ) )
yrbss_df
'ever_smoked' ] <-
yrbss_df[ , as.numeric( yrbss_df[ , 'q30' ] == 1 )
'q30' ] <- NULL yrbss_df[ ,
Recode each categorical variable to factor class:
'sex' ] <- relevel( factor( yrbss_df[ , 'sex' ] ) , ref = "2" )
yrbss_df[ ,
for ( i in c( 'race4' , 'grade' ) ){
<- relevel( factor( yrbss_df[ , i ] ) , ref = "1" )
yrbss_df[ , i ] }
Append Polynomials to Each Year
“The polynomials we have used as predictors to this point are natural polynomials, generated from the linear predictor by centering and then powering the linear predictor.”
For more detail on this subject, see page 216 of Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences By Jacob Cohen, Patricia Cohen, Stephen G. West, Leona S. Aiken
<- length( seq( 1991 , 2011 , 2 ) )
distinct_years_available
# store the linear polynomials
<- contr.poly( distinct_years_available )[ , ".L" ]
c11l
# store the quadratic polynomials
<- contr.poly( distinct_years_available )[ , ".Q" ]
c11q
# store the cubic polynomials
<- contr.poly( distinct_years_available )[ , ".C" ] c11c
For each record in the data set, tack on the linear, quadratic, and cubic contrast value, these contrast values will serve as replacement for the linear year
variable in any regression:
# year^1 term (linear)
"t11l" ] <- c11l[ match( yrbss_df[ , "year" ] , seq( 1991 , 2011 , 2 ) ) ]
yrbss_df[ ,
# year^2 term (quadratic)
"t11q" ] <- c11q[ match( yrbss_df[ , "year" ] , seq( 1991 , 2011 , 2 ) ) ]
yrbss_df[ ,
# year^3 term (cubic)
"t11c" ] <- c11c[ match( yrbss_df[ , "year" ] , seq( 1991 , 2011 , 2 ) ) ] yrbss_df[ ,
Unadjusted Analysis Examples
Construct a complex sample survey design and match the published unadjusted prevalence rates:
options( survey.lonely.psu = "adjust" )
library(survey)
<-
des svydesign(
id = ~psu ,
strata = ~interaction( stratum , year ) ,
data = yrbss_df ,
weights = ~weight ,
nest = TRUE
)
<-
prevalence_over_time svyby(
~ ever_smoked ,
~ year ,
des ,
svymean , na.rm = TRUE
)
# confirm prevalence rates match published estimates
# of high school students that ever smoked
stopifnot(
all.equal(
round( coef( prevalence_over_time ) , 3 ) ,
c( .701 , .695 , .713 , .702 , .704 , .639 , .584 , .543 , .503 , .463 , .447 ) ,
check.attributes = FALSE
) )
Calculate Joinpoints Needed
Using the orthogonal coefficients (linear, quadratic, cubic terms) that we previously added to our yrbss_df
object before constructing the multi-year stacked survey design, determine how many joinpoints will be needed for a trend analysis.
Epidemiological models typically control for possible confounding variables such as age, sex, and race/ethnicity, so those have been included alongside the linear, cubic, and quadratic year terms.
Calculate the “ever smoked” regression, adjusted by sex, grade, race/ethnicity, and linear year contrast:
<-
linyear svyglm(
~ sex + race4 + grade + t11l ,
ever_smoked design = des ,
family = quasibinomial
)
summary( linyear )
The linear year-contrast variable t11l
is significant. Therefore, there is probably going to be some sort of trend. A linear trend by itself does not need joinpoints. Not one, just zero joinpoints. If the linear term were the only significant term (out of linear, quadratic, cubic), then we would not need to calculate a joinpoint. In other words, we would not need to figure out where to best break our time trend into two, three, or even four segments.
Since the linear trend is significant, we know there is at least one change across the entire 1991 to 2011 period.
Interpretation note about segments of time: The linear term t11l
was significant, so we probably have a significant linear trend somewhere to report. Now we need to figure out when that significant linear trend started and when it ended. It might be semantically true that there was a significant linear decrease in high school aged smoking over the entire period of our data 1991-2011; however, it’s inexact to end this analysis after only detecting a linear trend. The purpose of the following few steps is to cordon off different time points from one another. As you’ll see later, there actually was not any detectable decrease from 1991-1999. The entirety of the decline in smoking occurred over the period from 1999-2011. So these next (methodologically tricky) steps serve to provide you and your audience with a more careful statement of statistical significance. It’s not technically wrong to conclude that smoking declined over the period of 1991-2011, it’s just verbose.
Think of it as the difference between “humans first walked on the moon in the sixties” and “humans first walked on the moon in 1969” - both statements are correct, but the latter exhibits greater scientific precision.
Calculate the “ever smoked” binomial regression, adjusted by sex, grade, race/ethnicity, and both linear and quadratic year contrasts. Notice the addition of t11q
:
<-
quadyear svyglm(
~ sex + race4 + grade + t11l + t11q ,
ever_smoked design = des ,
family = quasibinomial
)
summary( quadyear )
The linear year-contrast variable is significant but the quadratic year-contrast variable is also significant. Therefore, we should use joinpoint software (the segmented
package) for this analysis. A significant quadratic trend needs one joinpoint.
Since both linear and quadratic terms are significant, we can also move ahead and test whether the cubic term is also significant.
Calculate the “ever smoked” binomial regression, adjusted by sex, grade, race/ethnicity, and linear, quadratic, and cubic year contrasts. Notice the addition of t11c
:
<-
cubyear svyglm(
~ sex + race4 + grade + t11l + t11q + t11c ,
ever_smoked design = des ,
family = quasibinomial
)
summary( cubyear )
The cubic year-contrast term is also significant in this model. Therefore, we might potentially evaluate this trend using two joinpoints. In other words, a significant result for all linear, quadratic, and cubic year contrasts at this point means we might be able to evaluate three distinct trends (separated by our two joinpoints) across the broader 1991 - 2011 time period of analysis.
Although we might now have the statistical ability to analyze three distinct time periods (separated by two joinpoints) across our data, the utility of this depends on the circumstances. Cubic and higher polynomials account for not only the direction of change but also the pace of that change, allowing statistical statements that might not be of interest to an audience: While it might be an exercise in precision to conclude that smoking rates dropped quickest across 1999-2003 and less quickly across 2003-2011, that scientific pair of findings may not be as compelling as the simpler (quadratic but not cubic) statement that smoking rates have dropped across the period of 1999-2011.
Calculate Predicted Marginals
Calculate the survey-year-independent predictor effects and store these results:
<-
marginals svyglm(
formula = ever_smoked ~ sex + race4 + grade ,
design = des ,
family = quasibinomial
)
Run these marginals through the svypredmeans
function. For archaeology fans out there, this function emulates the PREDMARG
statement in the ancient language of SUDAAN:
<- svypredmeans( marginals , ~factor( year ) ) ) ( means_for_joinpoint
Clean up these results a bit in preparation for a joinpoint analysis:
# coerce the results to a data.frame object
<- as.data.frame( means_for_joinpoint )
means_for_joinpoint
# extract the row names as the survey year
"year" ] <- as.numeric( rownames( means_for_joinpoint ) )
means_for_joinpoint[ ,
# must be sorted, just in case it's not already
<- means_for_joinpoint[ order( means_for_joinpoint[ , "year" ] ) , ] means_for_joinpoint
Identify Joinpoint(s) or Breakpoint(s)
Let’s take a look at how confident we are in the value at each adjusted timepoint. Carrying out a trend analysis requires creating new weights to fit a piecewise linear regression. First, create that weight variable:
"wgt" ] <- with( means_for_joinpoint, ( mean / SE ) ^ 2 ) means_for_joinpoint[ ,
Second, fit a piecewise linear regression, estimating the ‘starting’ linear model with the usual lm
function using the log values and the weights:
<- lm( log( mean ) ~ year , weights = wgt , data = means_for_joinpoint ) o
Now that the regression has been structured correctly, estimate the year that our complex survey trend should be broken into two or more segments:
library(segmented)
# find only one joinpoint
<- segmented( o , ~year )
os
summary( os )
Look for the Estimated Break-Point(s)
in that result - that’s the critical number from this joinpoint analysis. The segmented
package uses an iterative procedure (described in the article below); between-year solutions are returned and should be rounded to the nearest time point in the analysis. The joinpoint software implements two estimating algorithms: the grid-search and the Hudson algorithm. For more detail about these methods, see Muggeo V. (2003) Estimating regression models with unknown break-points. Statistics in Medicine, 22: 3055-3071..
Obtain the annual percent change estimates for each time point:
slope( os , APC = TRUE )
The confidence intervals for the annual percent change (APC) may be different from the ones returned by NCI’s Joinpoint Software; for further details, check out Muggeo V. (2010) A Comment on `Estimating average annual per cent change in trend analysis’ by Clegg et al., Statistics in Medicine; 28, 3670-3682. Statistics in Medicine, 29, 1958-1960.
This analysis returned similar results to the NCI’s Joinpoint Regression Program by estimating a joinpoint at year=1999
- and, more precisely, that the start of that decreasing trend in smoking prevalence happened at an APC of -3.92 percent. That is, slope2
from the output above.
Remember that the cubic-year model above had significant terms as well. Therefore, it would be statistically defensible to calculate two joinpoints rather than only one. However, for this analyses, breaking the 1999-2011 trend into two separate downward trends might not be of interest to the audience. Looking at the slope2
and slope3
estimates and confidence intervals, we might be able to conclude that “ever smoking” decreased across 1999-2003 and also decreased (albeit less rapidly) across 2003-2011. However, communicating two consecutive downward trends might not be of much interest to a lay audience. Forgoing a second possible joinpoint makes sense when the direction of change is more compelling than the pace of change:
# find two joinpoints rather than only one
<- segmented( o , ~year , npsi = 2 )
os2
summary( os2 )
slope( os2 , APC = TRUE )
Interpret and Conclude
After identifying the joinpoint for smoking prevalence, we can create two regression models (one for each time segment - if we had two joinpoints, we would need three regression models). The first model covers the years leading up to (and including) the joinpoint (i.e., 1991 to 1999). The second model includes the years from the joinpoint forward (i.e., 1999 to 2011). So start with 1991, 1993, 1995, 1997, 1999, the five year-points before (and including) 1999:
# calculate a five-timepoint linear contrast vector
<- contr.poly( 5 )[ , 1 ]
c5l
# tack the five-timepoint linear contrast vectors onto the current survey design object
<- update( des , t5l = c5l[ match( year , seq( 1991 , 1999 , 2 ) ) ] )
des
<-
pre_91_99 svyglm(
~ sex + race4 + grade + t5l ,
ever_smoked design = subset( des , year <= 1999 ) ,
family = quasibinomial
)
summary( pre_91_99 )
# confirm 1991-1999 trend coefficient matches published estimates
stopifnot( round( pre_91_99$coefficients['t5l'] , 5 ) == .03704 )
This reproduces the calculations behind the sentence on pdf page 6 of the original document: In this example, T5L_L had a p-value=0.52261 and beta=0.03704. Therefore, there was “no significant change in the prevalence of ever smoking a cigarette during 1991-1999.”
Then move on to 1999, 2001, 2003, 2005, 2007, 2009, and 2011, the seven year-points after (and including) 1999:
# calculate a seven-timepoint linear contrast vector
<- contr.poly( 7 )[ , 1 ]
c7l
# tack the seven-timepoint linear contrast vectors onto the current survey design object
<- update( des , t7l = c7l[ match( year , seq( 1999 , 2011 , 2 ) ) ] )
des
<-
post_99_11 svyglm(
~ sex + race4 + grade + t7l ,
ever_smoked design = subset( des , year >= 1999 ) ,
family = quasibinomial
)
summary( post_99_11 )
# confirm 1999-2011 trend coefficient matches published estimates
stopifnot( round( post_99_11$coefficients['t7l'] , 5 ) == -0.99165 )
This reproduces the calculations behind the sentence on pdf page 6 of the original document: In this example, T7L_R had a p-value<0.0001 and beta=-0.99165. Therefore, there was a “significant linear decrease in the prevalence of ever smoking a cigarette during 1999-2011.”