Area Health Resources Files (AHRF)

License: GPL v3 Github Actions Badge

National, state, and county-level data on health care professions, health facilities, population characteristics, health workforce training, hospital utilization and expenditure, and the environment.


Please skim before you begin:

  1. User Documentation for the County Area Health Resources File (AHRF) 2021-2022 Release

  2. Frequently Asked Questions

  3. A haiku regarding this microdata:

# local aggregates
# to spread merge join spline regress
# like fresh buttered bread

Download, Import, Preparation

Download and import the most current county-level file:

library(haven)

tf <- tempfile()

ahrf_url <- "https://data.hrsa.gov//DataDownload/AHRF/AHRF_2021-2022_SAS.zip"

download.file( ahrf_url , tf , mode = 'wb' )

unzipped_files <- unzip( tf , exdir = tempdir() )

sas_fn <- grep( "\\.sas7bdat$" , unzipped_files , value = TRUE )

ahrf_tbl <- read_sas( sas_fn )

ahrf_df <- data.frame( ahrf_tbl )

names( ahrf_df ) <- tolower( names( ahrf_df ) )

Save Locally  

Save the object at any point:

# ahrf_fn <- file.path( path.expand( "~" ) , "AHRF" , "this_file.rds" )
# saveRDS( ahrf_df , file = ahrf_fn , compress = FALSE )

Load the same object:

# ahrf_df <- readRDS( ahrf_fn )

Variable Recoding

Add new columns to the data set:

ahrf_df <- 
    transform( 
        ahrf_df , 
        
        cbsa_indicator_code = 
            factor( 
                as.numeric( f1406720 ) , 
                levels = 0:2 ,
                labels = c( "not metro" , "metro" , "micro" ) 
            ) ,
            
        mhi_2020 = f1322620 ,
        
        whole_county_hpsa_2022 = as.numeric( f0978722 ) == 1 ,
        
        census_region = 
            factor( 
                as.numeric( f04439 ) , 
                levels = 1:4 ,
                labels = c( "northeast" , "midwest" , "south" , "west" ) 
            )

    )

Analysis Examples with base R  

Unweighted Counts

Count the unweighted number of records in the table, overall and by groups:

nrow( ahrf_df )

table( ahrf_df[ , "cbsa_indicator_code" ] , useNA = "always" )

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

mean( ahrf_df[ , "mhi_2020" ] , na.rm = TRUE )

tapply(
    ahrf_df[ , "mhi_2020" ] ,
    ahrf_df[ , "cbsa_indicator_code" ] ,
    mean ,
    na.rm = TRUE 
)

Calculate the distribution of a categorical variable, overall and by groups:

prop.table( table( ahrf_df[ , "census_region" ] ) )

prop.table(
    table( ahrf_df[ , c( "census_region" , "cbsa_indicator_code" ) ] ) ,
    margin = 2
)

Calculate the sum of a linear variable, overall and by groups:

sum( ahrf_df[ , "mhi_2020" ] , na.rm = TRUE )

tapply(
    ahrf_df[ , "mhi_2020" ] ,
    ahrf_df[ , "cbsa_indicator_code" ] ,
    sum ,
    na.rm = TRUE 
)

Calculate the median (50th percentile) of a linear variable, overall and by groups:

quantile( ahrf_df[ , "mhi_2020" ] , 0.5 , na.rm = TRUE )

tapply(
    ahrf_df[ , "mhi_2020" ] ,
    ahrf_df[ , "cbsa_indicator_code" ] ,
    quantile ,
    0.5 ,
    na.rm = TRUE 
)

Subsetting

Limit your data.frame to California:

sub_ahrf_df <- subset( ahrf_df , f12424 == "CA" )

Calculate the mean (average) of this subset:

mean( sub_ahrf_df[ , "mhi_2020" ] , na.rm = TRUE )

Measures of Uncertainty

Calculate the variance, overall and by groups:

var( ahrf_df[ , "mhi_2020" ] , na.rm = TRUE )

tapply(
    ahrf_df[ , "mhi_2020" ] ,
    ahrf_df[ , "cbsa_indicator_code" ] ,
    var ,
    na.rm = TRUE 
)

Regression Models and Tests of Association

Perform a t-test:

t.test( mhi_2020 ~ whole_county_hpsa_2022 , ahrf_df )

Perform a chi-squared test of association:

this_table <- table( ahrf_df[ , c( "whole_county_hpsa_2022" , "census_region" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        mhi_2020 ~ whole_county_hpsa_2022 + census_region , 
        data = ahrf_df
    )

summary( glm_result )

Replication Example

Match the record count in row number 8,543 of AHRF 2021-2022 Technical Documentation.xlsx:

stopifnot( nrow( ahrf_df ) == 3232 )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for AHRF users, this code replicates previously-presented examples:

library(dplyr)
ahrf_tbl <- as_tibble( ahrf_df )

Calculate the mean (average) of a linear variable, overall and by groups:

ahrf_tbl %>%
    summarize( mean = mean( mhi_2020 , na.rm = TRUE ) )

ahrf_tbl %>%
    group_by( cbsa_indicator_code ) %>%
    summarize( mean = mean( mhi_2020 , na.rm = TRUE ) )

Analysis Examples with data.table  

The R data.table library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for AHRF users, this code replicates previously-presented examples:

library(data.table)
ahrf_dt <- data.table( ahrf_df )

Calculate the mean (average) of a linear variable, overall and by groups:

ahrf_dt[ , mean( mhi_2020 , na.rm = TRUE ) ]

ahrf_dt[ , mean( mhi_2020 , na.rm = TRUE ) , by = cbsa_indicator_code ]

Analysis Examples with duckdb  

The R duckdb library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for AHRF users, this code replicates previously-presented examples:

library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'ahrf' , ahrf_df )

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( con , 'SELECT AVG( mhi_2020 ) FROM ahrf' )

dbGetQuery(
    con ,
    'SELECT
        cbsa_indicator_code ,
        AVG( mhi_2020 )
    FROM
        ahrf
    GROUP BY
        cbsa_indicator_code'
)