Medical Large Claims Experience Study (MLCES)
A high quality dataset of medical claims from seven private health insurance companies.
- One table with one row per individual with nonzero total paid charges. 
- A convenience sample of group (employer-sponsored) health insurers in the United States. 
- 1997 thru 1999 with no expected updates in the future. 
- Provided by the Society of Actuaries (SOA). 
Recommended Reading
Two Methodology Documents:
Group Medical Insurance Claims Database Collection and Analysis Report
Claim Severities, Claim Relativities, and Age: Evidence from SOA Group Health Data
One Haiku:
Download, Import, Preparation
Download and import the 1999 medical claims file:
tf <- tempfile()
this_url <- "https://www.soa.org/Files/Research/1999.zip"
download.file( this_url , tf , mode = 'wb' )
unzipped_file <- unzip( tf , exdir = tempdir() )
mlces_df <- read.csv( unzipped_file )
names( mlces_df ) <- tolower( names( mlces_df ) )Save Locally
Save the object at any point:
# mlces_fn <- file.path( path.expand( "~" ) , "MLCES" , "this_file.rds" )
# saveRDS( mlces_df , file = mlces_fn , compress = FALSE )Load the same object:
Variable Recoding
Add new columns to the data set:
mlces_df <- 
    transform( 
        mlces_df , 
        
        one = 1 ,
        
        claimant_relationship_to_policyholder =
            ifelse( relation == "E" , "covered employee" ,
            ifelse( relation == "S" , "spouse of covered employee" ,
            ifelse( relation == "D" , "dependent of covered employee" , NA ) ) ) ,
            
        ppo_plan = as.numeric( ppo == 'Y' )
    )Analysis Examples with base R
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( mlces_df[ , "totpdchg" ] )
tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    mean 
)Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( mlces_df[ , "patsex" ] ) )
prop.table(
    table( mlces_df[ , c( "patsex" , "claimant_relationship_to_policyholder" ) ] ) ,
    margin = 2
)Calculate the sum of a linear variable, overall and by groups:
sum( mlces_df[ , "totpdchg" ] )
tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    sum 
)Calculate the median (50th percentile) of a linear variable, overall and by groups:
Replication Example
This example matches statistics in Table II-A’s 1999 row numbers 52 and 53 from the Database:
Match Claimants Exceeding Deductible:
# $0 deductible
stopifnot( nrow( mlces_df ) == 1591738 )
# $1,000 deductible
mlces_above_1000_df <- subset( mlces_df , totpdchg > 1000 )
stopifnot( nrow( mlces_above_1000_df ) == 402550 )Match the Excess Charges Above Deductible:
# $0 deductible
stopifnot( round( sum( mlces_df[ , 'totpdchg' ] ) , 0 ) == 2599356658 )
# $1,000 deductible
stopifnot( round( sum( mlces_above_1000_df[ , 'totpdchg' ] - 1000 ) , 0 ) == 1883768786 )Analysis Examples with dplyr  
The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
mlces_tbl %>%
    summarize( mean = mean( totpdchg ) )
mlces_tbl %>%
    group_by( claimant_relationship_to_policyholder ) %>%
    summarize( mean = mean( totpdchg ) )Analysis Examples with data.table  
The R data.table library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
mlces_dt[ , mean( totpdchg ) ]
mlces_dt[ , mean( totpdchg ) , by = claimant_relationship_to_policyholder ]Analysis Examples with duckdb  
The R duckdb library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'mlces' , mlces_df )Calculate the mean (average) of a linear variable, overall and by groups: