National Plan and Provider Enumeration System (NPPES)
The registry of every medical practitioner actively operating in the United States healthcare industry.
A single large table with one row per enumerated health care provider.
A census of individuals and organizations that bill for medical services in the United States.
Updated weekly with new providers.
Maintained by the United States Centers for Medicare & Medicaid Services (CMS)
Please skim before you begin:
A haiku regarding this microdata:
# how many doctors
# ranked sergeant, last name pepper
# practice in the states?
Download, Import, Preparation
Download and import the national file:
library(readr)
<- tempfile()
tf
<-
npi_datapage readLines( "http://download.cms.gov/nppes/NPI_Files.html" )
<- grep( 'NPPES_Data_Dissemination_' , npi_datapage , value = TRUE )
latest_files
<- latest_files[ !grepl( 'Weekly Update' , latest_files ) ]
latest_files
<-
this_url paste0(
"http://download.cms.gov/nppes/",
gsub( "(.*)(NPPES_Data_Dissemination_.*\\.zip)(.*)$", "\\2", latest_files )
)
download.file( this_url , tf , mode = 'wb' )
<- unzip( tf , exdir = tempdir() )
npi_files
<-
npi_filepath grep(
"npidata_pfile_20050523-([0-9]+)\\.csv" ,
npi_files ,value = TRUE
)
<-
column_names names(
read.csv(
npi_filepath , nrow = 1 )[ FALSE , , ]
)
<- gsub( "\\." , "_" , tolower( column_names ) )
column_names
<-
column_types ifelse(
grepl( "code" , column_names ) &
!grepl( "country|state|gender|taxonomy|postal" , column_names ) ,
'n' , 'c'
)
<-
columns_to_import c( "entity_type_code" , "provider_gender_code" , "provider_enumeration_date" ,
"is_sole_proprietor" , "provider_business_practice_location_address_state_name" )
stopifnot( all( columns_to_import %in% column_names ) )
# readr::read_csv() columns must match their order in the csv file
<-
columns_to_import order( match( columns_to_import , column_names ) ) ]
columns_to_import[
<-
nppes_tbl ::read_csv(
readr
npi_filepath , col_names = columns_to_import ,
col_types =
paste0(
ifelse( column_names %in% columns_to_import , column_types , '_' ) ,
collapse = ""
) ,skip = 1
)
<-
nppes_df data.frame( nppes_tbl )
Analysis Examples with base R
Unweighted Counts
Count the unweighted number of records in the table, overall and by groups:
nrow( nppes_df )
table( nppes_df[ , "provider_gender_code" ] , useNA = "always" )
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
tapply(
"provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
nppes_df[ ,
mean ,na.rm = TRUE
)
Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( nppes_df[ , "is_sole_proprietor" ] ) )
prop.table(
table( nppes_df[ , c( "is_sole_proprietor" , "provider_gender_code" ) ] ) ,
margin = 2
)
Calculate the sum of a linear variable, overall and by groups:
sum( nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
tapply(
"provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
nppes_df[ ,
sum ,na.rm = TRUE
)
Calculate the median (50th percentile) of a linear variable, overall and by groups:
quantile( nppes_df[ , "provider_enumeration_year" ] , 0.5 , na.rm = TRUE )
tapply(
"provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
nppes_df[ ,
quantile ,0.5 ,
na.rm = TRUE
)
Subsetting
Limit your data.frame
to California:
<- subset( nppes_df , state_name = 'CA' ) sub_nppes_df
Calculate the mean (average) of this subset:
mean( sub_nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
Measures of Uncertainty
Calculate the variance, overall and by groups:
var( nppes_df[ , "provider_enumeration_year" ] , na.rm = TRUE )
tapply(
"provider_enumeration_year" ] ,
nppes_df[ , "provider_gender_code" ] ,
nppes_df[ ,
var ,na.rm = TRUE
)
Regression Models and Tests of Association
Perform a t-test:
t.test( provider_enumeration_year ~ individual , nppes_df )
Perform a chi-squared test of association:
<- table( nppes_df[ , c( "individual" , "is_sole_proprietor" ) ] )
this_table
chisq.test( this_table )
Perform a generalized linear model:
<-
glm_result glm(
~ individual + is_sole_proprietor ,
provider_enumeration_year data = nppes_df
)
summary( glm_result )
Analysis Examples with dplyr
The R dplyr
library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, and the tidyverse
style of non-standard evaluation. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
library(dplyr)
<- as_tibble( nppes_df ) nppes_tbl
Calculate the mean (average) of a linear variable, overall and by groups:
%>%
nppes_tbl summarize( mean = mean( provider_enumeration_year , na.rm = TRUE ) )
%>%
nppes_tbl group_by( provider_gender_code ) %>%
summarize( mean = mean( provider_enumeration_year , na.rm = TRUE ) )
Analysis Examples with data.table
The R data.table
library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
library(data.table)
<- data.table( nppes_df ) nppes_dt
Calculate the mean (average) of a linear variable, overall and by groups:
mean( provider_enumeration_year , na.rm = TRUE ) ]
nppes_dt[ ,
mean( provider_enumeration_year , na.rm = TRUE ) , by = provider_gender_code ] nppes_dt[ ,
Analysis Examples with duckdb
The R duckdb
library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for NPPES users, this code replicates previously-presented examples:
library(duckdb)
<- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
con dbWriteTable( con , 'nppes' , nppes_df )
Calculate the mean (average) of a linear variable, overall and by groups:
dbGetQuery( con , 'SELECT AVG( provider_enumeration_year ) FROM nppes' )
dbGetQuery(
con ,'SELECT
provider_gender_code ,
AVG( provider_enumeration_year )
FROM
nppes
GROUP BY
provider_gender_code'
)