Medical Large Claims Experience Study (MLCES)

License: GPL v3 Github Actions Badge

A high quality dataset of medical claims from seven private health insurance companies.

  • One table with one row per individual with nonzero total paid charges.

  • A convenience sample of group (employer-sponsored) health insurers in the United States.

  • 1997 thru 1999 with no expected updates in the future.

  • Provided by the Society of Actuaries (SOA).


Please skim before you begin:

  1. Group Medical Insurance Claims Database Collection and Analysis Report

  2. Claim Severities, Claim Relativities, and Age: Evidence from SOA Group Health Data

  3. A haiku regarding this microdata:

# skewed by black swan tails
# means, medians sing adieu
# claims distribution

Download, Import, Preparation

Download and import the 1999 medical claims file:

tf <- tempfile()

this_url <- "https://www.soa.org/Files/Research/1999.zip"

download.file( this_url , tf , mode = 'wb' )

unzipped_file <- unzip( tf , exdir = tempdir() )

mlces_df <- read.csv( unzipped_file )

names( mlces_df ) <- tolower( names( mlces_df ) )

Save Locally  

Save the object at any point:

# mlces_fn <- file.path( path.expand( "~" ) , "MLCES" , "this_file.rds" )
# saveRDS( mlces_df , file = mlces_fn , compress = FALSE )

Load the same object:

# mlces_df <- readRDS( mlces_fn )

Variable Recoding

Add new columns to the data set:

mlces_df <- 
    transform( 
        mlces_df , 
        
        one = 1 ,
        
        claimant_relationship_to_policyholder =
            ifelse( relation == "E" , "covered employee" ,
            ifelse( relation == "S" , "spouse of covered employee" ,
            ifelse( relation == "D" , "dependent of covered employee" , NA ) ) ) ,
            
        ppo_plan = as.numeric( ppo == 'Y' )
    )

Analysis Examples with base R  

Unweighted Counts

Count the unweighted number of records in the table, overall and by groups:

nrow( mlces_df )

table( mlces_df[ , "claimant_relationship_to_policyholder" ] , useNA = "always" )

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

mean( mlces_df[ , "totpdchg" ] )

tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    mean 
)

Calculate the distribution of a categorical variable, overall and by groups:

prop.table( table( mlces_df[ , "patsex" ] ) )

prop.table(
    table( mlces_df[ , c( "patsex" , "claimant_relationship_to_policyholder" ) ] ) ,
    margin = 2
)

Calculate the sum of a linear variable, overall and by groups:

sum( mlces_df[ , "totpdchg" ] )

tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    sum 
)

Calculate the median (50th percentile) of a linear variable, overall and by groups:

quantile( mlces_df[ , "totpdchg" ] , 0.5 )

tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    quantile ,
    0.5 
)

Subsetting

Limit your data.frame to persons under 18:

sub_mlces_df <- subset( mlces_df , ( ( claimyr - patbrtyr ) < 18 ) )

Calculate the mean (average) of this subset:

mean( sub_mlces_df[ , "totpdchg" ] )

Measures of Uncertainty

Calculate the variance, overall and by groups:

var( mlces_df[ , "totpdchg" ] )

tapply(
    mlces_df[ , "totpdchg" ] ,
    mlces_df[ , "claimant_relationship_to_policyholder" ] ,
    var 
)

Regression Models and Tests of Association

Perform a t-test:

t.test( totpdchg ~ ppo_plan , mlces_df )

Perform a chi-squared test of association:

this_table <- table( mlces_df[ , c( "ppo_plan" , "patsex" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        totpdchg ~ ppo_plan + patsex , 
        data = mlces_df
    )

summary( glm_result )

Replication Example

This example matches statistics in Table II-A’s 1999 row numbers 52 and 53 from the Database:

Match Claimants Exceeding Deductible:

# $0 deductible
stopifnot( nrow( mlces_df ) == 1591738 )

# $1,000 deductible
mlces_above_1000_df <- subset( mlces_df , totpdchg > 1000 )
stopifnot( nrow( mlces_above_1000_df ) == 402550 )

Match the Excess Charges Above Deductible:

# $0 deductible
stopifnot( round( sum( mlces_df[ , 'totpdchg' ] ) , 0 ) == 2599356658 )

# $1,000 deductible
stopifnot( round( sum( mlces_above_1000_df[ , 'totpdchg' ] - 1000 ) , 0 ) == 1883768786 )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:

library(dplyr)
mlces_tbl <- as_tibble( mlces_df )

Calculate the mean (average) of a linear variable, overall and by groups:

mlces_tbl %>%
    summarize( mean = mean( totpdchg ) )

mlces_tbl %>%
    group_by( claimant_relationship_to_policyholder ) %>%
    summarize( mean = mean( totpdchg ) )

Analysis Examples with data.table  

The R data.table library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:

library(data.table)
mlces_dt <- data.table( mlces_df )

Calculate the mean (average) of a linear variable, overall and by groups:

mlces_dt[ , mean( totpdchg ) ]

mlces_dt[ , mean( totpdchg ) , by = claimant_relationship_to_policyholder ]

Analysis Examples with duckdb  

The R duckdb library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for MLCES users, this code replicates previously-presented examples:

library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'mlces' , mlces_df )

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( con , 'SELECT AVG( totpdchg ) FROM mlces' )

dbGetQuery(
    con ,
    'SELECT
        claimant_relationship_to_policyholder ,
        AVG( totpdchg )
    FROM
        mlces
    GROUP BY
        claimant_relationship_to_policyholder'
)