Exame Nacional do Ensino Medio (ENEM)

Github Actions Badge

The national student aptitude test, used to assess high school completion and university admission.


Please skim before you begin:

  1. Leia_Me_Enem included in each annual zipped file

  2. Wikipedia Entry

  3. This human-composed haiku or a bouquet of artificial intelligence-generated limericks

# graduation stage
# shake hands, toss cap, unroll scroll,
# mais um exame?

Download, Import, Preparation

Download and unzip the 2022 file:

library(httr)
library(archive)

tf <- tempfile()

this_url <- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"

GET( this_url , write_disk( tf ) , progress() )

archive_extract( tf , dir = tempdir() )

Import the 2022 file:

library(readr)

enem_fns <- list.files( tempdir() , recursive = TRUE , full.names = TRUE )

enem_fn <- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )

enem_tbl <- read_csv2( enem_fn , locale = locale( encoding = 'latin1' ) )

enem_df <- data.frame( enem_tbl )

names( enem_df ) <- tolower( names( enem_df ) )

Save locally  

Save the object at any point:

# enem_fn <- file.path( path.expand( "~" ) , "ENEM" , "this_file.rds" )
# saveRDS( enem_df , file = enem_fn , compress = FALSE )

Load the same object:

# enem_df <- readRDS( enem_fn )

Variable Recoding

Add new columns to the data set:

enem_df <- 
    transform( 
        enem_df , 
        
        domestic_worker = as.numeric( q007 %in% c( 'B' , 'C' , 'D' ) ) ,
        
        administrative_category =
            factor(
                tp_dependencia_adm_esc ,
                levels = 1:4 ,
                labels = c( 'Federal' , 'Estadual' , 'Municipal' , 'Privada' )
            ) ,

        state_name = 
            factor( 
                co_uf_esc , 
                levels = c( 11:17 , 21:29 , 31:33 , 35 , 41:43 , 50:53 ) ,
                labels = c( "Rondonia" , "Acre" , "Amazonas" , 
                "Roraima" , "Para" , "Amapa" , "Tocantins" , 
                "Maranhao" , "Piaui" , "Ceara" , "Rio Grande do Norte" , 
                "Paraiba" , "Pernambuco" , "Alagoas" , "Sergipe" , 
                "Bahia" , "Minas Gerais" , "Espirito Santo" , 
                "Rio de Janeiro" , "Sao Paulo" , "Parana" , 
                "Santa Catarina" , "Rio Grande do Sul" , 
                "Mato Grosso do Sul" , "Mato Grosso" , "Goias" , 
                "Distrito Federal" )
            )

    )

Analysis Examples with base R  

Unweighted Counts

Count the unweighted number of records in the table, overall and by groups:

nrow( enem_df )

table( enem_df[ , "administrative_category" ] , useNA = "always" )

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

mean( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )

tapply(
    enem_df[ , "nu_nota_mt" ] ,
    enem_df[ , "administrative_category" ] ,
    mean ,
    na.rm = TRUE 
)

Calculate the distribution of a categorical variable, overall and by groups:

prop.table( table( enem_df[ , "state_name" ] ) )

prop.table(
    table( enem_df[ , c( "state_name" , "administrative_category" ) ] ) ,
    margin = 2
)

Calculate the sum of a linear variable, overall and by groups:

sum( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )

tapply(
    enem_df[ , "nu_nota_mt" ] ,
    enem_df[ , "administrative_category" ] ,
    sum ,
    na.rm = TRUE 
)

Calculate the median (50th percentile) of a linear variable, overall and by groups:

quantile( enem_df[ , "nu_nota_mt" ] , 0.5 , na.rm = TRUE )

tapply(
    enem_df[ , "nu_nota_mt" ] ,
    enem_df[ , "administrative_category" ] ,
    quantile ,
    0.5 ,
    na.rm = TRUE 
)

Subsetting

Limit your data.frame to mother graduated from high school:

sub_enem_df <- subset( enem_df , q002 %in% c( 'E' , 'F' , 'G' ) )

Calculate the mean (average) of this subset:

mean( sub_enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )

Measures of Uncertainty

Calculate the variance, overall and by groups:

var( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )

tapply(
    enem_df[ , "nu_nota_mt" ] ,
    enem_df[ , "administrative_category" ] ,
    var ,
    na.rm = TRUE 
)

Regression Models and Tests of Association

Perform a t-test:

t.test( nu_nota_mt ~ domestic_worker , enem_df )

Perform a chi-squared test of association:

this_table <- table( enem_df[ , c( "domestic_worker" , "state_name" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        nu_nota_mt ~ domestic_worker + state_name , 
        data = enem_df
    )

summary( glm_result )

Replication Example

This example matches the registration counts in the Sinopse ENEM 2022 Excel table:

stopifnot( nrow( enem_df ) == 3476105 )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:

library(dplyr)
enem_tbl <- as_tibble( enem_df )

Calculate the mean (average) of a linear variable, overall and by groups:

enem_tbl %>%
    summarize( mean = mean( nu_nota_mt , na.rm = TRUE ) )

enem_tbl %>%
    group_by( administrative_category ) %>%
    summarize( mean = mean( nu_nota_mt , na.rm = TRUE ) )

Analysis Examples with data.table  

The R data.table library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:

library(data.table)
enem_dt <- data.table( enem_df )

Calculate the mean (average) of a linear variable, overall and by groups:

enem_dt[ , mean( nu_nota_mt , na.rm = TRUE ) ]

enem_dt[ , mean( nu_nota_mt , na.rm = TRUE ) , by = administrative_category ]

Analysis Examples with duckdb  

The R duckdb library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:

library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
dbWriteTable( con , 'enem' , enem_df )

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( con , 'SELECT AVG( nu_nota_mt ) FROM enem' )

dbGetQuery(
    con ,
    'SELECT
        administrative_category ,
        AVG( nu_nota_mt )
    FROM
        enem
    GROUP BY
        administrative_category'
)