National Survey on Drug Use and Health (NSDUH)

License: GPL v3 Github Actions Badge

The primary survey to measure of prevalence of substance use and its correlates in the United States.

  • One table with one row per sampled respondent.

  • A complex survey designed to generalize to civilian, non-institutional americans aged 12 and older.

  • Released periodically since 1979 and annually since 1990.

  • Administered by the Substance Abuse and Mental Health Services Administration.


Please skim before you begin:

  1. 2021 National Survey on Drug Use and Health (NSDUH): Public Use File Codebook

  2. 2021 National Survey on Drug Use and Health (NSDUH): Methodological Summary and Definitions

  3. A haiku regarding this microdata:

# drinking and thinking
# about your first time, were you
# smoking and joking?

Download, Import, Preparation

Download and import the national file:

zip_tf <- tempfile()

zip_url <-
    paste0(
        "https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/" ,
        "studies/NSDUH-2021/NSDUH-2021-datasets/NSDUH-2021-DS0001/" ,
        "NSDUH-2021-DS0001-bundles-with-study-info/NSDUH-2021-DS0001-bndl-data-r_v3.zip"
    )
    
download.file( zip_url , zip_tf , mode = 'wb' )

nsduh_rdata <- unzip( zip_tf , exdir = tempdir() )

nsduh_rdata_contents <- load( nsduh_rdata )

nsduh_df_name <- grep( 'PUF' , nsduh_rdata_contents , value = TRUE )

nsduh_df <- get( nsduh_df_name )

names( nsduh_df ) <- tolower( names( nsduh_df ) )

nsduh_df[ , 'one' ] <- 1

Save Locally  

Save the object at any point:

# nsduh_fn <- file.path( path.expand( "~" ) , "NSDUH" , "this_file.rds" )
# saveRDS( nsduh_df , file = nsduh_fn , compress = FALSE )

Load the same object:

# nsduh_df <- readRDS( nsduh_fn )

Survey Design Definition

Construct a complex sample survey design:

library(survey)

nsduh_design <- 
    svydesign( 
        id = ~ verep , 
        strata = ~ vestr_c , 
        data = nsduh_df , 
        weights = ~ analwt_c , 
        nest = TRUE 
    )

Variable Recoding

Add new columns to the data set:

nsduh_design <- 
    update( 
        nsduh_design , 
        
        one = 1 ,
        
        health = 
            factor( 
                health , 
                levels = 1:5 , 
                labels = c( "excellent" , "very good" , "good" ,
                    "fair" , "poor" )
            ) ,
            
        age_first_cigarette = ifelse( cigtry > 99 , NA , cigtry ) ,
        
        age_tried_cocaine = ifelse( cocage > 99 , NA , cocage ) ,

        ever_used_marijuana = as.numeric( ifelse( mjever < 4 , mjever == 1 , NA ) ) ,
        
        county_type =
            factor(
                coutyp4 ,
                levels = 1:3 ,
                labels = c( "large metro" , "small metro" , "nonmetro" )
            )
            
    )

Analysis Examples with the survey library  

Unweighted Counts

Count the unweighted number of records in the survey sample, overall and by groups:

sum( weights( nsduh_design , "sampling" ) != 0 )

svyby( ~ one , ~ county_type , nsduh_design , unwtd.count )

Weighted Counts

Count the weighted size of the generalizable population, overall and by groups:

svytotal( ~ one , nsduh_design )

svyby( ~ one , ~ county_type , nsduh_design , svytotal )

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

svymean( ~ age_first_cigarette , nsduh_design , na.rm = TRUE )

svyby( ~ age_first_cigarette , ~ county_type , nsduh_design , svymean , na.rm = TRUE )

Calculate the distribution of a categorical variable, overall and by groups:

svymean( ~ health , nsduh_design , na.rm = TRUE )

svyby( ~ health , ~ county_type , nsduh_design , svymean , na.rm = TRUE )

Calculate the sum of a linear variable, overall and by groups:

svytotal( ~ age_first_cigarette , nsduh_design , na.rm = TRUE )

svyby( ~ age_first_cigarette , ~ county_type , nsduh_design , svytotal , na.rm = TRUE )

Calculate the weighted sum of a categorical variable, overall and by groups:

svytotal( ~ health , nsduh_design , na.rm = TRUE )

svyby( ~ health , ~ county_type , nsduh_design , svytotal , na.rm = TRUE )

Calculate the median (50th percentile) of a linear variable, overall and by groups:

svyquantile( ~ age_first_cigarette , nsduh_design , 0.5 , na.rm = TRUE )

svyby( 
    ~ age_first_cigarette , 
    ~ county_type , 
    nsduh_design , 
    svyquantile , 
    0.5 ,
    ci = TRUE , na.rm = TRUE
)

Estimate a ratio:

svyratio( 
    numerator = ~ age_first_cigarette , 
    denominator = ~ age_tried_cocaine , 
    nsduh_design ,
    na.rm = TRUE
)

Subsetting

Restrict the survey design to individuals who are pregnant:

sub_nsduh_design <- subset( nsduh_design , preg == 1 )

Calculate the mean (average) of this subset:

svymean( ~ age_first_cigarette , sub_nsduh_design , na.rm = TRUE )

Measures of Uncertainty

Extract the coefficient, standard error, confidence interval, and coefficient of variation from any descriptive statistics function result, overall and by groups:

this_result <- svymean( ~ age_first_cigarette , nsduh_design , na.rm = TRUE )

coef( this_result )
SE( this_result )
confint( this_result )
cv( this_result )

grouped_result <-
    svyby( 
        ~ age_first_cigarette , 
        ~ county_type , 
        nsduh_design , 
        svymean ,
        na.rm = TRUE 
    )
    
coef( grouped_result )
SE( grouped_result )
confint( grouped_result )
cv( grouped_result )

Calculate the degrees of freedom of any survey design object:

degf( nsduh_design )

Calculate the complex sample survey-adjusted variance of any statistic:

svyvar( ~ age_first_cigarette , nsduh_design , na.rm = TRUE )

Include the complex sample design effect in the result for a specific statistic:

# SRS without replacement
svymean( ~ age_first_cigarette , nsduh_design , na.rm = TRUE , deff = TRUE )

# SRS with replacement
svymean( ~ age_first_cigarette , nsduh_design , na.rm = TRUE , deff = "replace" )

Compute confidence intervals for proportions using methods that may be more accurate near 0 and 1. See ?svyciprop for alternatives:

svyciprop( ~ ever_used_marijuana , nsduh_design ,
    method = "likelihood" , na.rm = TRUE )

Regression Models and Tests of Association

Perform a design-based t-test:

svyttest( age_first_cigarette ~ ever_used_marijuana , nsduh_design )

Perform a chi-squared test of association for survey data:

svychisq( 
    ~ ever_used_marijuana + health , 
    nsduh_design 
)

Perform a survey-weighted generalized linear model:

glm_result <- 
    svyglm( 
        age_first_cigarette ~ ever_used_marijuana + health , 
        nsduh_design 
    )

summary( glm_result )

Replication Example

This matches the prevalence and SE of alcohol use in the past month from Codebook Table G.2:

result <- svymean( ~ alcmon , nsduh_design )

stopifnot( round( coef( result ) , 3 ) == 0.474 )
stopifnot( round( SE( result ) , 4 ) == 0.0043 )

Analysis Examples with srvyr  

The R srvyr library calculates summary statistics from survey data, such as the mean, total or quantile using dplyr-like syntax. srvyr allows for the use of many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, the tidyverse style of non-standard evaluation and more consistent return types than the survey package. This vignette details the available features. As a starting point for NSDUH users, this code replicates previously-presented examples:

library(srvyr)
nsduh_srvyr_design <- as_survey( nsduh_design )

Calculate the mean (average) of a linear variable, overall and by groups:

nsduh_srvyr_design %>%
    summarize( mean = survey_mean( age_first_cigarette , na.rm = TRUE ) )

nsduh_srvyr_design %>%
    group_by( county_type ) %>%
    summarize( mean = survey_mean( age_first_cigarette , na.rm = TRUE ) )