# Statistically Significant Trends with Multiple Years of Complex Survey Data

*Contributed by Thomas Yokota <thomasyokota@gmail.com>*

Palermo Professor Vito Muggeo wrote the joinpoint analysis section of the code below to demonstrate that the `segmented`

package eliminates the need for external (registration-only, windows-only, workflow-disrupting) software. `survey`

package creator and professor Dr. Thomas Lumley wrote the `svypredmeans`

function to replicate SUDAAN’s `PREDMARG`

command and match the CDC to the decimal. Dr. Richard Lowry, M.D. at the Centers for Disease Control & Prevention wrote the original linear trend analysis and then answered our infinite questions. Thanks to everyone.

The purpose of this analysis is to make statistically valid statements such as, *“there was a significant linear decrease in the prevalence of high school aged americans who have ever smoked a cigarette across the period 1999-2011”* with complex sample survey data.

This step-by-step walkthrough exactly reproduces the statistics presented in the Center for Disease Control & Prevention’s (CDC) linear trend analysis, using free and open source methods rather than proprietary or restricted software.

The example below displays only linearized designs (created with the `svydesign`

function). For more detail about how to reproduce this analysis with a replicate-weighted design (created with the `svrepdesign`

function), see the methods note below section #4.

## Data Importation

Prior to running this analysis script, the Youth Risk Behavioral Surveillance System (YRBSS) 1991-2011 single-year files must all be loaded as R data files (.rda) on your local machine. Running the download automation script will create the appropriate files. If you need assistance with the data-loading step, first review the main YRBSS blog post.

```
options( survey.lonely.psu = "adjust" )
library(survey)
library(lodown)
# retrieve a listing of all available extracts for the european social survey
yrbss_cat <- get_catalog( "yrbss" , output_dir = file.path( path.expand( "~" ) , "YRBSS" ) )
# limit the catalog to only years 2005-2015
yrbss_cat <- subset( yrbss_cat , year %in% seq( 2005 , 2015 , 2 ) )
# download the yrbss microdata
lodown( "yrbss" , yrbss_cat )
```

## Load Required Packages, Options, External Functions

```
# install.packages( c( "segmented" , "ggplot2" , "ggthemes" , "texreg" ) )
# Muggeo V. (2008) Segmented: an R package to fit regression models with broken-line relationships. R News, 8, 1: 20-25.
library(segmented)
library(ggplot2)
library(ggthemes)
library(texreg)
```

## Harmonize and Stack Multiple Years of Survey Data

This step is clearly dataset-specific. In order for your trend analysis to work, you’ll need to figure out how to align the variables from multiple years of data into a trendable, stacked `data.frame`

object.

```
# initiate an empty `y` object
y <- NULL
# loop through each year of YRBSS microdata
for ( year in seq( 2005 , 2015 , 2 ) ){
# load the current year
x <- readRDS(
file.path( path.expand( "~" ) , "YRBSS" ,
paste0( year , " main.rds" ) ) )
# tack on a `year` column
x$year <- year
if( year == 2005 ) x$raceeth <- NA
# stack that year of data alongside the others,
# ignoring mis-matching columns
y <- rbind( x[ c( "q2" , "q3" , "q4" , "qn10" , "year" , "psu" , "stratum" , "weight" , "raceeth" ) ] , y )
# clear the single-year of microdata from RAM
rm( x )
}
# convert every column to numeric type
y[ , ] <- sapply( y[ , ] , as.numeric )
# construct year-specific recodes so that
# "ever smoked a cigarette" // grade // sex // race-ethnicity align across years
y <-
transform(
y ,
rode_with_drunk_driver = qn10 ,
raceeth =
ifelse( year == 2005 ,
ifelse( q4 %in% 6 , 1 ,
ifelse( q4 %in% 3 , 2 ,
ifelse( q4 %in% c( 4 , 7 ) , 3 ,
ifelse( q4 %in% c( 1 , 2 , 5 , 8 ) , 4 , NA ) ) ) ) ,
ifelse( raceeth %in% 5 , 1 ,
ifelse( raceeth %in% 3 , 2 ,
ifelse( raceeth %in% c( 6 , 7 ) , 3 ,
ifelse( raceeth %in% c( 1 , 2 , 4 , 8 ) , 4 , NA ) ) ) ) ) ,
grade = ifelse( q3 == 5 , NA , as.numeric( q3 ) ) ,
sex = ifelse( q2 %in% 1:2 , q2 , NA )
)
# again remove unnecessary variables, keeping only the complex sample survey design columns
# plus independent/dependent variables to be used in the regression analyses
y <- y[ c( "year" , "psu" , "stratum" , "weight" , "rode_with_drunk_driver" , "raceeth" , "sex" , "grade" ) ]
# set female to the reference group
y$sex <- relevel( factor( y$sex ) , ref = "2" )
# set ever smoked=yes // white // 9th graders as the reference groups
for ( i in c( 'rode_with_drunk_driver' , 'raceeth' , 'grade' ) ) y[ , i ] <- relevel( factor( y[ , i ] ) , ref = "1" )
```

## Construct a Multi-Year Stacked Complex Survey Design Object

Before constructing a multi-year stacked design object, check out `?contr.poly`

- this function implements polynomials used in our trend analysis during step #6. For more detail on this subject, see page 216 of Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences By Jacob Cohen, Patricia Cohen, Stephen G. West, Leona S. Aiken *“The polynomials we have used as predictors to this point are natural polynomials, generated from the linear predictor by centering and the powering the linear predictor.”*

```
# extract a linear contrast vector of length eleven,
# because we have eleven distinct years of yrbss data `seq( 2005 , 2015 , 2 )`
c6l <- contr.poly( 6 )[ , 1 ]
# also extract a quadratic (squared) contrast vector
c6q <- contr.poly( 6 )[ , 2 ]
# just in case, extract a cubic contrast vector
c6c <- contr.poly( 6 )[ , 3 ]
# for each record in the data set, tack on the linear, quadratic, and cubic contrast value
# these contrast values will serve as replacement for the linear `year` variable in any regression.
# year^1 term (linear)
y$t6l <- c6l[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# year^2 term (quadratic)
y$t6q <- c6q[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# year^3 term (cubic)
y$t6c <- c6c[ match( y$year , seq( 2005 , 2015 , 2 ) ) ]
# construct a complex sample survey design object
# stacking multiple years and accounting for `year` in the nested strata
des <-
svydesign(
id = ~psu ,
strata = ~interaction( stratum , year ) ,
data = y ,
weights = ~weight ,
nest = TRUE
)
```

Now we’ve got a multi-year stack of complex survey designs with linear, quadratic, and cubic contrast values appended. If you’d like more detail about stacking multiple years of complex survey data, review the CDC’s manual on the topic. Hopefully we won’t need anything beyond cubic, but let’s find out.

**Methods note** about how to stack replication designs: This is only relevant if you are trying to create a `des`

like the object above but just have replicate weights and do not have the clustering information (psu). It is straightforward to construct a replication design *from* a linearized design (see `as.svrepdesign`

). However, for privacy reasons, going in the opposite direction is much more challenging. Therefore, you’ll need to do some dataset-specific homework on how to best *stack* multiple years of a replicate-weighted design you construct a multiple-year-stacked survey design like the object above.

If you’d like to experiment with how the two approaches differ (theoretically, very little), these publicly-available survey data sets include both replicate weights and, separately, clustering information:

Medical Expenditure Panel Survey

National Health and Nutrition Examination Survey

Consumer Expenditure Survey

In most cases, omitting the `year`

variable from the `strata = ~interaction( stratum , year )`

construction of `des`

above will make your standard errors larger (conservative) -> ergo -> you can probably just `rbind( file_with_repweights_year_one , file_with_repweights_year_two , ... )`

so long as the survey design has not changed in structure over the time period that you are analyzing. Once you have the rbound replicate weights object for every year, you could just construct one huge multi-year `svrepdesign`

object. Make sure you include `scale`

, `rscales`

, `rho`

, and whatever else the `svrepdesign()`

call asks for. If you are worried you missed something, check `attributes( your_single_year_replication_design_object )`

. This solution is likely to be a decent approach in most cases.

If you need to be very conservative with your computation of trend statistical significance, you might attempt to re-construct fake clusters for yourself using a regression. Search for “malicious” in this confidentiality explanation document. The purpose here, though, isn’t to identify individual respondents in the dataset, it’s to get a variable like `psu`

above that gives you reasonable standard errors. Look for the object `your.replicate.weights`

in that script. You could reconstruct a fake psu for each record in your data set with something as easy as..

```
# # fake_psu should be a one-record-per-person vector object
# # that can immediately be appended onto your data set.
# fake_psu <- kmeans( your.replicate.weights , 20 )
```

..where 20 is the (completely made up) number of clusters x strata. Hopefully the methodology documents (or the people who wrote them) will at least tell you how many clusters there were in the original sample, even if the clusters themselves were not disclosed. At the point you’ve made fake clusters, they will surely be worse than the real clusters (i.e. conservative standard errors) and you can construct a multiple-year survey design with:

`# des <- svydesign( id = ~ your_fake_psus , strata = ~ year , data = y , weights = ~ weight , nest = TRUE )`

This approach will probably be conservative probably.

## Review the unadjusted results

Here’s the change over time for smoking prevalence among youth. Unadjusted prevalence rates (Figure 1) suggest a significant change in smoking prevalence.

```
# immediately remove records with missing smoking status
des_ns <- subset( des , !is.na( rode_with_drunk_driver ) )
# calculate unadjusted, un-anythinged "ever smoked" rates by year
# note that this reproduces the unadjusted "ever smoked" statistics at the top of
# pdf page 6 of http://www.cdc.gov/healthyyouth/yrbs/pdf/yrbs_conducting_trend_analyses.pdf
unadjusted <- svyby( ~ rode_with_drunk_driver , ~ year , svymean , design = des_ns , vartype = c( 'ci' , 'se' ) )
# coerce that result into a `data.frame` object
my_plot <- data.frame( unadjusted )
# plot the unadjusted decline in smoking
ggplot( my_plot , aes( x = year, y = rode_with_drunk_driver1 ) ) +
geom_point() +
geom_errorbar( aes( ymax = ci_u.rode_with_drunk_driver1 , ymin = ci_l.rode_with_drunk_driver1 ) , width = .2 ) +
geom_line() +
theme_tufte() +
ggtitle( "Figure 1. Unadjusted smoking prevalence 1999-2011" ) +
theme( plot.title = element_text( size = 9 , face = "bold" ) )
```

## Calculate the Number of Joinpoints Needed

Using the orthogonal coefficients (linear, quadratic, cubic terms) that we previously added to our `data.frame`

object before constructing the multi-year stacked survey design, let’s now determine how many joinpoints will be needed for a trend analysis.

Epidemiological models typically control for possible confounding variables such as sex and race, so let’s add them in with the linear, cubic, and quadratic year terms.

Calculate the “ever smoked” binomial regression, adjusted by sex, age, race-ethnicity, and a linear year contrast.

```
linyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( linyear )
```

The linear year-contrast variable `t11l`

is hugely significant here. Therefore, there is probably going to be some sort of trend. A linear trend does not need joinpoints. Not one, just zero joinpoints. If the linear term were the only significant term (out of linear, quadratic, cubic, etc.), then we would not need to calculate a joinpoint. In other words, we would not need to figure out where to best break our time trend into two, three, or even four segments.

The linear trend is significant, so we should keep going.

**Interpretation note** about segments of time: The linear term `t11l`

was significant, so we probably have a significant linear trend somewhere to report. Now we need to figure out when that significant linear trend started and when it ended. It might be semantically true that there was a significant linear decrease in high school aged smoking over the entire period of our data 1991-2011; however, it’s inexact, unrefined to give up after only detecting a linear trend. The purpose of the following few steps is really to *cordon off* different time points from one another. As you’ll see later, there actually was not any detectable decrease from 1991-1999. The entirety of the decline in smoking occurred over the period from 1999-2011. So these next (methodologically tricky) steps serve to provide you and your audience with a more careful statement of statistical significance. It’s not technically wrong to conclude that smoking declined over the period of 1991-2011, it’s just verbose.

Think of it as the difference between “humans first walked on the moon in the sixties” and “humans first walked on the moon in 1969” - both statements are correct, but the latter exhibits greater scientific precision.

Calculate the “ever smoked” binomial regression, adjusted by sex, age, race-ethnicity, and both linear and quadratic year contrasts. Notice the addition of `t11q`

.

```
quadyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l + t6q ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( quadyear )
```

The linear year-contrast variable is hugely significant here but the quadratic year-contrast variable is also significant. Therefore, we should use joinpoint software for this analysis. A significant quadratic trend needs one joinpoint.

Since both linear and quadratic terms are significant, we should move ahead and test whether the cubic term is also significant.

Calculate the “ever smoked” binomial regression, adjusted by sex, age, race-ethnicity, and linear, quadratic, and cubic year contrasts. Notice the addition of `t11c`

.

```
cubyear <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t6l + t6q + t6c ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 ) ,
family = quasibinomial
)
summary( cubyear )
```

The cubic year-contrast term is *not* significant in this model. Therefore, we should stop testing the shape of this line. In other words, we can stop at a quadratic trend and *do not* need a cubic trend. That means we can stop at a single joinpoint. Remember: a linear trend requires zero joinpoints, a quadratic trend typically requires one joinpoint, a cubic trend usually requires two, and on and on.

Note: if the cubic trend *were* significant, then we would increase the number of joinpoints to *two* instead of *one* but since the cubic term is not significant, we should stop with the previous regression. If we keep getting significant trends, we ought to continue testing whether higher terms continue to be significant. So `year^4`

requires three joinpoints, `year^5`

requires four joinpoints, and so on. If these terms continued to be significant, we would need to return to step #4 and add additional `year^n`

terms to the model.

```
htmlreg(list(linyear , quadyear , cubyear), doctype = F, html.tag = F, inline.css = T,
head.tag = F, body.tag = F, center = F, single.row = T, caption = "Table 1. Testing for linear trends")
```

## Calculate the Adjusted Prevalence and Predicted Marginals

First, calculate the survey-year-independent predictor effects and store these results into a separate object.

```
marginals <-
svyglm(
formula = I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade ,
design = des_ns ,
family = quasibinomial
)
```

Second, run these marginals through the `svypredmeans`

function written by Dr. Thomas Lumley. For any archaeology fans out there, this function emulates the `PREDMARG`

statement in the ancient language of SUDAAN.

`( means_for_joinpoint <- svypredmeans( marginals , ~factor( year ) ) )`

Finally, clean up these results a bit in preparation for a joinpoint analysis.

```
# coerce the results to a data.frame object
means_for_joinpoint <- as.data.frame( means_for_joinpoint )
# extract the row names as the survey year
means_for_joinpoint$year <- as.numeric( rownames( means_for_joinpoint ) )
# must be sorted, just in case it's not already
means_for_joinpoint <- means_for_joinpoint[ order( means_for_joinpoint$year ) , ]
# rename columns so they do not conflict with variables in memory
names( means_for_joinpoint ) <- c( 'mean' , 'se' , 'yr' )
# the above line is only because the ?segmented function (used below)
# does not work if an object of the same name is also in memory.
another_plot <- means_for_joinpoint
another_plot$ci_l.mean <- another_plot$mean - (1.96 * another_plot$se)
another_plot$ci_u.mean <- another_plot$mean + (1.96 * another_plot$se)
ggplot(another_plot, aes(x = yr, y = mean)) +
geom_point() +
geom_errorbar(aes(ymax = ci_u.mean, ymin = ci_l.mean), width=.2) +
geom_line() +
theme_tufte() +
ggtitle("Figure 2. Adjusted smoking prevalence 1999-2011") +
theme(plot.title = element_text(size=9, face="bold"))
```

## Identify the Breakpoint/Changepoint

The original CDC analysis recommended some external software from the National Cancer Institute, which only runs on selected platforms. Dr. Vito Muggeo wrote this within-R solution using his segmented package available on CRAN. Let’s take a look at how confident we are in the value at each adjusted timepoint. Carrying out a trend analysis requires creating new weights to fit a piecewise linear regression. Figure 3 shows the relationship between variance at each datum and weighting. Larger circles display greater uncertainty and therefore lower weight.

```
ggplot( means_for_joinpoint , aes( x = yr , y = mean ) ) +
geom_point( aes( size = se ) ) +
theme_tufte() +
ggtitle( "Figure 3. Standard Error at each timepoint\n(smaller dots indicate greater confidence in each adjusted value)"
)
```

First, create that weight variable.

`means_for_joinpoint$wgt <- with( means_for_joinpoint, ( mean / se ) ^ 2 ) `

Second, fit a piecewise linear regression.

```
# estimate the 'starting' linear model with the usual "lm" function using the log values and the weights.
o <- lm( log( mean ) ~ yr , weights = wgt , data = means_for_joinpoint )
```

Now that the regression has been structured correctly, estimate the year that our complex survey trend should be broken into two segments (the changepoint/breakpoint/joinpoint).

```
# the segmented() function uses a random process in its algorithm.
# setting the random seed ensures reproducibility
set.seed( 2015 )
# add a segmented variable (`yr` in this example) with 1 breakpoint
os <- segmented( o , ~yr )
# `os` is now a `segmented` object, which means it includes information on the fitted model,
# such as parameter estimates, standard errors, residuals.
summary( os )
```

See the `Estimated Break-Point(s)`

in that result? That’s the critical number from this joinpoint analysis.

Note that the above number is not an integer! The R `segmented`

package uses an iterative procedure (described in the article below) and therefore between-year solutions are returned. The joinpoint software implements two estimating algorithms: the grid-search and the Hudson algorithm. For more detail about these methods, see Muggeo V. (2003) Estimating regression models with unknown break-points. Statistics in Medicine, 22: 3055-3071..

```
# figuring out the breakpoint year was the purpose of this joinpoint analysis.
( your_breakpoint <- round( as.vector( os$psi[, "Est." ] ) ) )
# so. that's a joinpoint. that's where the two line segments join. okay?
# obtain the annual percent change (APC=) estimates for each time point
slope( os , APC = TRUE )
```

The returned CIs for the annual percent change (APC) may be different from the ones returned by NCI’s Joinpoint Software; for further details, check out Muggeo V. (2010) A Comment on `Estimating average annual per cent change in trend analysis’ by Clegg et al., Statistics in Medicine; 28, 3670-3682. Statistics in Medicine, 29, 1958-1960.

This analysis returned similar results to the NCI’s Joinpoint Regression Program by estimating a changepoint at `year=1999`

- and, more precisely, that the start of that decreasing trend in smoking prevalence happened at an APC of -3.92 percent. That is, `slope2`

from the output above.

## Make statistically defensible statements about linear trends with complex survey data

After identifying the change point for smoking prevalence, we can create two regression models (one for each time segment). (If we had two joinpoints, we would need three regression models.) The first model covers the years leading up to (and including) the changepoint (i.e., 1991 to 1999). The second model includes the years from the changepoint forward (i.e., 1999 to 2011). So start with 1991, 1993, 1995, 1997, 1999, the five year-points before (and including 1999).

```
# calculate a five-timepoint linear contrast vector
c3l <- contr.poly( 3 )[ , 1 ]
# tack the five-timepoint linear contrast vectors onto the current survey design object
des_ns <- update( des_ns , t3l = c3l[ match( year , seq( 2005 , 2009 , 2 ) ) ] )
pre_91_99 <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t3l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 & year <= 2009 ) ,
family = quasibinomial
)
summary( pre_91_99 )
```

Reproduce the sentence on pdf page 6 of the original document. **In this example, T5L_L had a p-value=0.52261 and beta=0.03704. Therefore, there was “no significant change in the prevalence of ever smoking a cigarette during 1991-1999.”**

Then move on to 1999, 2001, 2003, 2005, 2007, 2009, and 2011, the seven year-points after (and including 1999).

```
# calculate a seven-timepoint linear contrast vector
c4l <- contr.poly( 4 )[ , 1 ]
# tack the seven-timepoint linear contrast vectors onto the current survey design object
des_ns <- update( des_ns , t4l = c4l[ match( year , seq( 2009 , 2015 , 2 ) ) ] )
post_99_11 <-
svyglm(
I( rode_with_drunk_driver == 1 ) ~ sex + raceeth + grade + t4l ,
design = subset( des_ns , rode_with_drunk_driver %in% 1:2 & year >= 2009 ) ,
family = quasibinomial
)
summary( post_99_11 )
```

Reproduce the sentence on pdf page 6 of the original document. **In this example, T7L_R had a p-value<0.0001 and beta=-0.99165. Therefore, there was a “significant linear decrease in the prevalence of ever smoking a cigarette during 1999-2011.”**

Note also that the 1999-2011 time period saw a linear decrease, which supports the APC estimate in step #8. Here’s everything displayed as a single coherent table.

```
htmlreg(list(pre_91_99, post_99_11), doctype = F, html.tag = F, inline.css = T,
head.tag = F, body.tag = F, center = F, single.row = T, caption = "Table 2. Linear trends pre-post changepoint")
```

*This analysis may complement qualitative evaluation on prevalence changes observed from surveillance data by providing quantitative evidence, such as when a change point occurred. This analysis does not explain why or how changes in trends occur.*