Medical Large Claims Experience Study (MLCES)
A high quality dataset of medical claims from seven private health insurance companies.
One table with one row per individual with nonzero total paid charges.
A convenience sample of group (employer-sponsored) health insurers in the United States.
1997 thru 1999 with no expected updates in the future.
Provided by the Society of Actuaries (SOA).
Please skim before you begin:
Group Medical Insurance Claims Database Collection and Analysis Report
Claim Severities, Claim Relativities, and Age: Evidence from SOA Group Health Data
A haiku regarding this microdata:
Download, Import, Preparation
Download and import the 1999 medical claims file:
tf <- tempfile()
this_url <- "https://www.soa.org/Files/Research/1999.zip"
download.file( this_url , tf , mode = 'wb' )
unzipped_file <- unzip( tf , exdir = tempdir() )
mlces_df <- read.csv( unzipped_file )
names( mlces_df ) <- tolower( names( mlces_df ) )
Save Locally
Save the object at any point:
# mlces_fn <- file.path( path.expand( "~" ) , "MLCES" , "this_file.rds" )
# saveRDS( mlces_df , file = mlces_fn , compress = FALSE )
Load the same object:
Variable Recoding
Add new columns to the data set:
mlces_df <-
transform(
mlces_df ,
one = 1 ,
claimant_relationship_to_policyholder =
ifelse( relation == "E" , "covered employee" ,
ifelse( relation == "S" , "spouse of covered employee" ,
ifelse( relation == "D" , "dependent of covered employee" , NA ) ) ) ,
ppo_plan = as.numeric( ppo == 'Y' )
)
Analysis Examples with base R
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( mlces_df[ , "totpdchg" ] )
tapply(
mlces_df[ , "totpdchg" ] ,
mlces_df[ , "claimant_relationship_to_policyholder" ] ,
mean
)
Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( mlces_df[ , "patsex" ] ) )
prop.table(
table( mlces_df[ , c( "patsex" , "claimant_relationship_to_policyholder" ) ] ) ,
margin = 2
)
Calculate the sum of a linear variable, overall and by groups:
sum( mlces_df[ , "totpdchg" ] )
tapply(
mlces_df[ , "totpdchg" ] ,
mlces_df[ , "claimant_relationship_to_policyholder" ] ,
sum
)
Calculate the median (50th percentile) of a linear variable, overall and by groups:
Replication Example
This example matches statistics in Table II-A’s 1999 row numbers 52 and 53 from the Database:
Match Claimants Exceeding Deductible:
# $0 deductible
stopifnot( nrow( mlces_df ) == 1591738 )
# $1,000 deductible
mlces_above_1000_df <- subset( mlces_df , totpdchg > 1000 )
stopifnot( nrow( mlces_above_1000_df ) == 402550 )
Match the Excess Charges Above Deductible:
# $0 deductible
stopifnot( round( sum( mlces_df[ , 'totpdchg' ] ) , 0 ) == 2599356658 )
# $1,000 deductible
stopifnot( round( sum( mlces_above_1000_df[ , 'totpdchg' ] - 1000 ) , 0 ) == 1883768786 )
Analysis Examples with dplyr
The R dplyr
library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, and the tidyverse
style of non-standard evaluation. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
mlces_tbl %>%
summarize( mean = mean( totpdchg ) )
mlces_tbl %>%
group_by( claimant_relationship_to_policyholder ) %>%
summarize( mean = mean( totpdchg ) )
Analysis Examples with data.table
The R data.table
library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
mlces_dt[ , mean( totpdchg ) ]
mlces_dt[ , mean( totpdchg ) , by = claimant_relationship_to_policyholder ]
Analysis Examples with duckdb
The R duckdb
library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'mlces' , mlces_df )
Calculate the mean (average) of a linear variable, overall and by groups: