Exame Nacional do Ensino Medio (ENEM)
The national student aptitude test, used to assess high school completion and university admission.
One table with one row per test-taking student, a second of study habit questionnaire respondents.
Updated annually since 1998.
Maintained by Brazil’s Instituto Nacional de Estudos e Pesquisas Educacionais Anisio Teixeira
Please skim before you begin:
Leia_Me_Enem
included in each annual zipped fileA haiku regarding this microdata:
# graduation stage
# shake hands, toss cap, unroll scroll,
# mais um exame?
Download, Import, Preparation
Download and unzip the 2022 file:
library(httr)
library(archive)
<- tempfile()
tf
<- "https://download.inep.gov.br/microdados/microdados_enem_2022.zip"
this_url
GET( this_url , write_disk( tf ) , progress() )
archive_extract( tf , dir = tempdir() )
Import the 2022 file:
library(readr)
<- list.files( tempdir() , recursive = TRUE , full.names = TRUE )
enem_fns
<- grep( "MICRODADOS_ENEM_([0-9][0-9][0-9][0-9])\\.csv$" , enem_fns , value = TRUE )
enem_fn
<- read_csv2( enem_fn , locale = locale( encoding = 'latin1' ) )
enem_tbl
<- data.frame( enem_tbl )
enem_df
names( enem_df ) <- tolower( names( enem_df ) )
Save Locally
Save the object at any point:
# enem_fn <- file.path( path.expand( "~" ) , "ENEM" , "this_file.rds" )
# saveRDS( enem_df , file = enem_fn , compress = FALSE )
Load the same object:
# enem_df <- readRDS( enem_fn )
Variable Recoding
Add new columns to the data set:
<-
enem_df transform(
enem_df ,
domestic_worker = as.numeric( q007 %in% c( 'B' , 'C' , 'D' ) ) ,
administrative_category =
factor(
tp_dependencia_adm_esc ,levels = 1:4 ,
labels = c( 'Federal' , 'Estadual' , 'Municipal' , 'Privada' )
) ,
state_name =
factor(
co_uf_esc , levels = c( 11:17 , 21:29 , 31:33 , 35 , 41:43 , 50:53 ) ,
labels = c( "Rondonia" , "Acre" , "Amazonas" ,
"Roraima" , "Para" , "Amapa" , "Tocantins" ,
"Maranhao" , "Piaui" , "Ceara" , "Rio Grande do Norte" ,
"Paraiba" , "Pernambuco" , "Alagoas" , "Sergipe" ,
"Bahia" , "Minas Gerais" , "Espirito Santo" ,
"Rio de Janeiro" , "Sao Paulo" , "Parana" ,
"Santa Catarina" , "Rio Grande do Sul" ,
"Mato Grosso do Sul" , "Mato Grosso" , "Goias" ,
"Distrito Federal" )
)
)
Analysis Examples with base R
Unweighted Counts
Count the unweighted number of records in the table, overall and by groups:
nrow( enem_df )
table( enem_df[ , "administrative_category" ] , useNA = "always" )
Descriptive Statistics
Calculate the mean (average) of a linear variable, overall and by groups:
mean( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )
tapply(
"nu_nota_mt" ] ,
enem_df[ , "administrative_category" ] ,
enem_df[ ,
mean ,na.rm = TRUE
)
Calculate the distribution of a categorical variable, overall and by groups:
prop.table( table( enem_df[ , "state_name" ] ) )
prop.table(
table( enem_df[ , c( "state_name" , "administrative_category" ) ] ) ,
margin = 2
)
Calculate the sum of a linear variable, overall and by groups:
sum( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )
tapply(
"nu_nota_mt" ] ,
enem_df[ , "administrative_category" ] ,
enem_df[ ,
sum ,na.rm = TRUE
)
Calculate the median (50th percentile) of a linear variable, overall and by groups:
quantile( enem_df[ , "nu_nota_mt" ] , 0.5 , na.rm = TRUE )
tapply(
"nu_nota_mt" ] ,
enem_df[ , "administrative_category" ] ,
enem_df[ ,
quantile ,0.5 ,
na.rm = TRUE
)
Subsetting
Limit your data.frame
to mother graduated from high school:
<- subset( enem_df , q002 %in% c( 'E' , 'F' , 'G' ) ) sub_enem_df
Calculate the mean (average) of this subset:
mean( sub_enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )
Measures of Uncertainty
Calculate the variance, overall and by groups:
var( enem_df[ , "nu_nota_mt" ] , na.rm = TRUE )
tapply(
"nu_nota_mt" ] ,
enem_df[ , "administrative_category" ] ,
enem_df[ ,
var ,na.rm = TRUE
)
Regression Models and Tests of Association
Perform a t-test:
t.test( nu_nota_mt ~ domestic_worker , enem_df )
Perform a chi-squared test of association:
<- table( enem_df[ , c( "domestic_worker" , "state_name" ) ] )
this_table
chisq.test( this_table )
Perform a generalized linear model:
<-
glm_result glm(
~ domestic_worker + state_name ,
nu_nota_mt data = enem_df
)
summary( glm_result )
Replication Example
This example matches the registration counts in the Sinopse ENEM 2022 Excel table:
stopifnot( nrow( enem_df ) == 3476105 )
Analysis Examples with dplyr
The R dplyr
library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize
, group_by
, and mutate
, the convenience of pipe-able functions, and the tidyverse
style of non-standard evaluation. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:
library(dplyr)
<- as_tibble( enem_df ) enem_tbl
Calculate the mean (average) of a linear variable, overall and by groups:
%>%
enem_tbl summarize( mean = mean( nu_nota_mt , na.rm = TRUE ) )
%>%
enem_tbl group_by( administrative_category ) %>%
summarize( mean = mean( nu_nota_mt , na.rm = TRUE ) )
Analysis Examples with data.table
The R data.table
library provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. data.table offers concise syntax: fast to type, fast to read, fast speed, memory efficiency, a careful API lifecycle management, an active community, and a rich set of features. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:
library(data.table)
<- data.table( enem_df ) enem_dt
Calculate the mean (average) of a linear variable, overall and by groups:
mean( nu_nota_mt , na.rm = TRUE ) ]
enem_dt[ ,
mean( nu_nota_mt , na.rm = TRUE ) , by = administrative_category ] enem_dt[ ,
Analysis Examples with duckdb
The R duckdb
library provides an embedded analytical data management system with support for the Structured Query Language (SQL). duckdb offers a simple, feature-rich, fast, and free SQL OLAP management system. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:
library(duckdb)
<- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
con dbWriteTable( con , 'enem' , enem_df )
Calculate the mean (average) of a linear variable, overall and by groups:
dbGetQuery( con , 'SELECT AVG( nu_nota_mt ) FROM enem' )
dbGetQuery(
con ,'SELECT
administrative_category ,
AVG( nu_nota_mt )
FROM
enem
GROUP BY
administrative_category'
)