National Plan and Provider Enumeration System (NPPES)
The registry of every medical practitioner actively operating in the United States healthcare industry.
A single large table with one row per enumerated health care provider.
A census of individuals and organizations that bill for medical services in the United States.
Updated weekly with new providers.
Maintained by the United States Centers for Medicare & Medicaid Services (CMS)
Please skim before you begin:
A haiku regarding this microdata:
Download, Import, Preparation
Download and import the national file:
library(readr)
tf <- tempfile()
npi_datapage <-
readLines( "http://download.cms.gov/nppes/NPI_Files.html" )
latest_files <- grep( 'NPPES_Data_Dissemination_' , npi_datapage , value = TRUE )
latest_files <- latest_files[ !grepl( 'Weekly Update' , latest_files ) ]
this_url <-
paste0(
"http://download.cms.gov/nppes/",
gsub( "(.*)(NPPES_Data_Dissemination_.*\\.zip)(.*)$", "\\2", latest_files )
)
download.file( this_url , tf , mode = 'wb' )
npi_files <- unzip( tf , exdir = tempdir() )
npi_filepath <-
grep(
"npidata_pfile_20050523-([0-9]+)\\.csv" ,
npi_files ,
value = TRUE
)
column_names <-
names(
read.csv(
npi_filepath ,
nrow = 1 )[ FALSE , , ]
)
column_names <- gsub( "\\." , "_" , tolower( column_names ) )
column_types <-
ifelse(
grepl( "code" , column_names ) &
!grepl( "country|state|gender|taxonomy|postal" , column_names ) ,
'n' , 'c'
)
columns_to_import <-
c( "entity_type_code" , "provider_gender_code" , "provider_enumeration_date" ,
"is_sole_proprietor" , "provider_business_practice_location_address_state_name" )
stopifnot( all( columns_to_import %in% column_names ) )
# readr::read_csv() columns must match their order in the csv file
columns_to_import <-
columns_to_import[ order( match( columns_to_import , column_names ) ) ]
nppes_tbl <-
readr::read_csv(
npi_filepath ,
col_names = columns_to_import ,
col_types =
paste0(
ifelse( column_names %in% columns_to_import , column_types , '_' ) ,
collapse = ""
) ,
skip = 1
)
nppes_df <-
data.frame( nppes_tbl )
Analysis Examples with base R
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
tapply(
nppes_df[ , "provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
mean ,
na.rm = TRUE
)
Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( nppes_df[ , "is_sole_proprietor" ] ) )
prop.table(
table( nppes_df[ , c( "is_sole_proprietor" , "provider_gender_code" ) ] ) ,
margin = 2
)
Calculate the sum of a linear variable, overall and by groups:
sum( nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
tapply(
nppes_df[ , "provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
sum ,
na.rm = TRUE
)
Calculate the median (50th percentile) of a linear variable, overall and by groups:
Regression Models and Tests of Association
Perform a t-test:
Perform a chi-squared test of association:
this_table <- table( nppes_df[ , c( "individual" , "is_sole_proprietor" ) ] )
chisq.test( this_table )
Perform a generalized linear model:
glm_result <-
glm(
provider_enumeration_year ~ individual + is_sole_proprietor ,
data = nppes_df
)
summary( glm_result )
Analysis Examples with dplyr
The R dplyr
library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, and the tidyverse
style of non-standard evaluation. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
nppes_tbl %>%
summarize( mean = mean( provider_enumeration_year , na.rm = TRUE ) )
nppes_tbl %>%
group_by( provider_gender_code ) %>%
summarize( mean = mean( provider_enumeration_year , na.rm = TRUE ) )
Analysis Examples with data.table
The R data.table
library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
Calculate the mean (average) of a linear variable, overall and by groups:
nppes_dt[ , mean( provider_enumeration_year , na.rm = TRUE ) ]
nppes_dt[ , mean( provider_enumeration_year , na.rm = TRUE ) , by = provider_gender_code ]
Analysis Examples with duckdb
The R duckdb
library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'nppes' , nppes_df )
Calculate the mean (average) of a linear variable, overall and by groups: