# Surveillance Epidemiology and End Results (SEER)

The Surveillance Epidemiology and End Results (SEER) aggregates person-level information for more than a quarter of cancer incidence in the United States.

A series of both individual- and population-level tables, grouped by site of cancer diagnosis.

A registry covering various geographies across the US population, standardized by SEER*Stat to produce nationally-representative estimates.

Updated every spring based on the previous November’s submission of data.

Maintained by the United States National Cancer Institute (NCI)

## Simplified Download and Importation

The R `lodown`

package easily downloads and imports all available SEER microdata by simply specifying `"seer"`

with an `output_dir =`

parameter in the `lodown()`

function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

```
library(lodown)
lodown( "seer" , output_dir = file.path( path.expand( "~" ) , "SEER" ) ,
your_username = "username" ,
your_password = "password" )
```

## Analysis Examples with base R

Load a data frame:

```
available_files <-
list.files(
file.path( path.expand( "~" ) , "SEER" ) ,
recursive = TRUE ,
full.names = TRUE
)
seer_df <-
readRDS( grep( "incidence(.*)yr1973(.*)LYMYLEUK" , available_files , value = TRUE ) )
```

### Variable Recoding

Add new columns to the data set:

```
seer_df <-
transform(
seer_df ,
survival_months = ifelse( srv_time_mon == 9999 , NA , as.numeric( srv_time_mon ) ) ,
female = as.numeric( sex == 2 ) ,
race_ethnicity =
ifelse( race1v == 99 , "unknown" ,
ifelse( nhiade > 0 , "hispanic" ,
ifelse( race1v == 1 , "white non-hispanic" ,
ifelse( race1v == 2 , "black non-hispanic" ,
"other non-hispanic" ) ) ) ) ,
marital_status_at_dx =
factor(
as.numeric( mar_stat ) ,
levels = c( 1:6 , 9 ) ,
labels =
c(
"single (never married)" ,
"married" ,
"separated" ,
"divorced" ,
"widowed" ,
"unmarried or domestic partner or unregistered" ,
"unknown"
)
)
)
```

### Unweighted Counts

Count the unweighted number of records in the table, overall and by groups:

```
nrow( seer_df )
table( seer_df[ , "race_ethnicity" ] , useNA = "always" )
```

### Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

```
mean( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
mean ,
na.rm = TRUE
)
```

Calculate the distribution of a categorical variable, overall and by groups:

```
prop.table( table( seer_df[ , "marital_status_at_dx" ] ) )
prop.table(
table( seer_df[ , c( "marital_status_at_dx" , "race_ethnicity" ) ] ) ,
margin = 2
)
```

Calculate the sum of a linear variable, overall and by groups:

```
sum( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
sum ,
na.rm = TRUE
)
```

Calculate the median (50th percentile) of a linear variable, overall and by groups:

```
quantile( seer_df[ , "survival_months" ] , 0.5 , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
quantile ,
0.5 ,
na.rm = TRUE
)
```

### Subsetting

Limit your `data.frame`

to inpatient hospital reporting source:

`sub_seer_df <- subset( seer_df , rept_src == 1 )`

Calculate the mean (average) of this subset:

`mean( sub_seer_df[ , "survival_months" ] , na.rm = TRUE )`

### Measures of Uncertainty

Calculate the variance, overall and by groups:

```
var( seer_df[ , "survival_months" ] , na.rm = TRUE )
tapply(
seer_df[ , "survival_months" ] ,
seer_df[ , "race_ethnicity" ] ,
var ,
na.rm = TRUE
)
```

### Regression Models and Tests of Association

Perform a t-test:

`t.test( survival_months ~ female , seer_df )`

Perform a chi-squared test of association:

```
this_table <- table( seer_df[ , c( "female" , "marital_status_at_dx" ) ] )
chisq.test( this_table )
```

Perform a generalized linear model:

```
glm_result <-
glm(
survival_months ~ female + marital_status_at_dx ,
data = seer_df
)
summary( glm_result )
```

## Analysis Examples with `dplyr`

The R `dplyr`

library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as `summarize`

, `group_by`

, and `mutate`

, the convenience of pipe-able functions, and the `tidyverse`

style of non-standard evaluation. This vignette details the available features. As a starting point for SEER users, this code replicates previously-presented examples:

```
library(dplyr)
seer_tbl <- tbl_df( seer_df )
```

Calculate the mean (average) of a linear variable, overall and by groups:

```
seer_tbl %>%
summarize( mean = mean( survival_months , na.rm = TRUE ) )
seer_tbl %>%
group_by( race_ethnicity ) %>%
summarize( mean = mean( survival_months , na.rm = TRUE ) )
```