Surveillance Epidemiology and End Results (SEER)
The Surveillance Epidemiology and End Results (SEER) aggregates person-level information for more than a quarter of cancer incidence in the United States.
A series of both individual- and population-level tables, grouped by site of cancer diagnosis.
A registry covering various geographies across the US population, standardized by SEER*Stat to produce nationally-representative estimates.
Updated every spring based on the previous November’s submission of data.
Maintained by the United States National Cancer Institute (NCI)
Simplified Download and Importation
The R lodown
package easily downloads and imports all available SEER microdata by simply specifying "seer"
with an output_dir =
parameter in the lodown()
function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.
library(lodown)
lodown( "seer" , output_dir = file.path( path.expand( "~" ) , "SEER" ) ,
your_username = "username" ,
your_password = "password" )
Analysis Examples with base R
Load a data frame:
available_files <-
list.files(
file.path( path.expand( "~" ) , "SEER" ) ,
recursive = TRUE ,
full.names = TRUE
)
seer_df <-
readRDS( grep( "incidence(.*)yr1973(.*)LYMYLEUK" , available_files , value = TRUE ) )
Variable Recoding
Add new columns to the data set:
seer_df <-
transform(
seer_df ,
survival_months = ifelse( srv_time_mon == 9999 , NA , as.numeric( srv_time_mon ) ) ,
female = as.numeric( sex == 2 ) ,
race_ethnicity =
ifelse( race1v == 99 , "unknown" ,
ifelse( nhiade > 0 , "hispanic" ,
ifelse( race1v == 1 , "white non-hispanic" ,
ifelse( race1v == 2 , "black non-hispanic" ,
"other non-hispanic" ) ) ) ) ,
marital_status_at_dx =
factor(
as.numeric( mar_stat ) ,
levels = c( 1:6 , 9 ) ,
labels =
c(
"single (never married)" ,
"married" ,
"separated" ,
"divorced" ,
"widowed" ,
"unmarried or domestic partner or unregistered" ,
"unknown"
)
)
)
Unweighted Counts
Count the unweighted number of records in the table, overall and by groups:
nrow( seer_df )
table( seer_df[ , "race_ethnicity" ] , useNA = "always" )
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
mean ,
na.rm = TRUE
)
Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( seer_df[ , "marital_status_at_dx" ] ) )
prop.table(
table( seer_df[ , c( "marital_status_at_dx" , "race_ethnicity" ) ] ) ,
margin = 2
)
Calculate the sum of a linear variable, overall and by groups:
sum( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
sum ,
na.rm = TRUE
)
Calculate the median (50th percentile) of a linear variable, overall and by groups:
quantile( seer_df[ , "survival_months" ] , 0.5 , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
quantile ,
0.5 ,
na.rm = TRUE
)
Subsetting
Limit your data.frame
to inpatient hospital reporting source:
sub_seer_df <- subset( seer_df , rept_src == 1 )
Calculate the mean (average) of this subset:
mean( sub_seer_df[ , "survival_months" ] , na.rm = TRUE )
Measures of Uncertainty
Calculate the variance, overall and by groups:
var( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
var ,
na.rm = TRUE
)
Regression Models and Tests of Association
Perform a t-test:
t.test( survival_months ~ female , seer_df )
Perform a chi-squared test of association:
this_table <- table( seer_df[ , c( "female" , "marital_status_at_dx" ) ] )
chisq.test( this_table )
Perform a generalized linear model:
glm_result <-
glm(
survival_months ~ female + marital_status_at_dx ,
data = seer_df
)
summary( glm_result )
Analysis Examples with dplyr
The R dplyr
library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, and the tidyverse
style of non-standard evaluation. This vignette details the available features. As a starting point for SEER users, this code replicates previously-presented examples:
library(dplyr)
seer_tbl <- tbl_df( seer_df )
Calculate the mean (average) of a linear variable, overall and by groups:
seer_tbl %>%
summarize( mean = mean( survival_months , na.rm = TRUE ) )
seer_tbl %>%
group_by( race_ethnicity ) %>%
summarize( mean = mean( survival_months , na.rm = TRUE ) )