Surveillance Epidemiology and End Results (SEER)

Build Status Build status

The Surveillance Epidemiology and End Results (SEER) aggregates person-level information for more than a quarter of cancer incidence in the United States.

  • A series of both individual- and population-level tables, grouped by site of cancer diagnosis.

  • A registry covering various geographies across the US population, standardized by SEER*Stat to produce nationally-representative estimates.

  • Updated every spring based on the previous November’s submission of data.

  • Maintained by the United States National Cancer Institute (NCI)

Simplified Download and Importation

The R lodown package easily downloads and imports all available SEER microdata by simply specifying "seer" with an output_dir = parameter in the lodown() function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

library(lodown)
lodown( "seer" , output_dir = file.path( path.expand( "~" ) , "SEER" ) , 
    your_username = "username" , 
    your_password = "password" )

Analysis Examples with base R  

Load a data frame:

available_files <-
    list.files( 
        file.path( path.expand( "~" ) , "SEER" ) , 
        recursive = TRUE , 
        full.names = TRUE 
    )

seer_df <- 
    readRDS( grep( "incidence(.*)yr1973(.*)LYMYLEUK" , available_files , value = TRUE ) )

Variable Recoding

Add new columns to the data set:

seer_df <- 
    transform( 
        seer_df , 
        
        survival_months = ifelse( srv_time_mon == 9999 , NA , as.numeric( srv_time_mon ) ) ,
        
        female = as.numeric( sex == 2 ) ,
        
        race_ethnicity =
            ifelse( race1v == 99 , "unknown" ,
            ifelse( nhiade > 0 , "hispanic" , 
            ifelse( race1v == 1 , "white non-hispanic" ,
            ifelse( race1v == 2 , "black non-hispanic" , 
                "other non-hispanic" ) ) ) ) ,
        
        marital_status_at_dx =
            factor( 
                as.numeric( mar_stat ) , 
                levels = c( 1:6 , 9 ) ,
                labels =
                    c(
                        "single (never married)" ,
                        "married" ,
                        "separated" ,
                        "divorced" ,
                        "widowed" ,
                        "unmarried or domestic partner or unregistered" ,
                        "unknown"
                    )
            )
    )

Unweighted Counts

Count the unweighted number of records in the table, overall and by groups:

nrow( seer_df )

table( seer_df[ , "race_ethnicity" ] , useNA = "always" )

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

mean( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
    seer_df[ , "survival_months" ] ,
    seer_df[ , "race_ethnicity" ] ,
    mean ,
    na.rm = TRUE 
)

Calculate the distribution of a categorical variable, overall and by groups:

prop.table( table( seer_df[ , "marital_status_at_dx" ] ) )

prop.table(
    table( seer_df[ , c( "marital_status_at_dx" , "race_ethnicity" ) ] ) ,
    margin = 2
)

Calculate the sum of a linear variable, overall and by groups:

sum( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
    seer_df[ , "survival_months" ] ,
    seer_df[ , "race_ethnicity" ] ,
    sum ,
    na.rm = TRUE 
)

Calculate the median (50th percentile) of a linear variable, overall and by groups:

quantile( seer_df[ , "survival_months" ] , 0.5 , na.rm = TRUE )

tapply(
    seer_df[ , "survival_months" ] ,
    seer_df[ , "race_ethnicity" ] ,
    quantile ,
    0.5 ,
    na.rm = TRUE 
)

Subsetting

Limit your data.frame to inpatient hospital reporting source:

sub_seer_df <- subset( seer_df , rept_src == 1 )

Calculate the mean (average) of this subset:

mean( sub_seer_df[ , "survival_months" ] , na.rm = TRUE )

Measures of Uncertainty

Calculate the variance, overall and by groups:

var( seer_df[ , "survival_months" ] , na.rm = TRUE )

tapply(
    seer_df[ , "survival_months" ] ,
    seer_df[ , "race_ethnicity" ] ,
    var ,
    na.rm = TRUE 
)

Regression Models and Tests of Association

Perform a t-test:

t.test( survival_months ~ female , seer_df )

Perform a chi-squared test of association:

this_table <- table( seer_df[ , c( "female" , "marital_status_at_dx" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        survival_months ~ female + marital_status_at_dx , 
        data = seer_df
    )

summary( glm_result )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for SEER users, this code replicates previously-presented examples:

library(dplyr)
seer_tbl <- tbl_df( seer_df )

Calculate the mean (average) of a linear variable, overall and by groups:

seer_tbl %>%
    summarize( mean = mean( survival_months , na.rm = TRUE ) )

seer_tbl %>%
    group_by( race_ethnicity ) %>%
    summarize( mean = mean( survival_months , na.rm = TRUE ) )