Exame Nacional do Ensino Medio (ENEM)

Build Status Build status

Contributed by Dr. Djalma Pessoa <pessoad@gmail.com>

The Exame Nacional do Ensino Medio (ENEM) contains the standardized test results of most Brazilian high school students.

Simplified Download and Importation

The R lodown package easily downloads and imports all available ENEM microdata by simply specifying "enem" with an output_dir = parameter in the lodown() function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

library(lodown)
lodown( "enem" , output_dir = file.path( path.expand( "~" ) , "ENEM" ) )

lodown also provides a catalog of available microdata extracts with the get_catalog() function. After requesting the ENEM catalog, you could pass a subsetted catalog through the lodown() function in order to download and import specific extracts (rather than all available extracts).

library(lodown)
# examine all available ENEM microdata files
enem_cat <-
    get_catalog( "enem" ,
        output_dir = file.path( path.expand( "~" ) , "ENEM" ) )

# 2015 only
enem_cat <- subset( enem_cat , year == 2015 )
# download the microdata to your local computer
lodown( "enem" , enem_cat )

Analysis Examples with SQL and RSQLite  

Connect to a database:

library(DBI)
dbdir <- file.path( path.expand( "~" ) , "ENEM" , "SQLite.db" )
db <- dbConnect( RSQLite::SQLite() , dbdir )

Variable Recoding

Add new columns to the data set:

dbSendQuery( db , "ALTER TABLE microdados_enem_2015 ADD COLUMN female INTEGER" )

dbSendQuery( db , 
    "UPDATE microdados_enem_2015 
    SET female = 
        CASE WHEN tp_sexo = 2 THEN 1 ELSE 0 END" 
)

dbSendQuery( db , "ALTER TABLE microdados_enem_2015 ADD COLUMN fathers_education INTEGER" )

dbSendQuery( db , 
    "UPDATE microdados_enem_2015 
    SET fathers_education = 
        CASE WHEN q001 = 1 THEN '01 - nao estudou'
            WHEN q001 = 2 THEN '02 - 1 a 4 serie'
            WHEN q001 = 3 THEN '03 - 5 a 8 serie'
            WHEN q001 = 4 THEN '04 - ensino medio incompleto'
            WHEN q001 = 5 THEN '05 - ensino medio'
            WHEN q001 = 6 THEN '06 - ensino superior incompleto'
            WHEN q001 = 7 THEN '07 - ensino superior'
            WHEN q001 = 8 THEN '08 - pos-graduacao'
            WHEN q001 = 9 THEN '09 - nao estudou' ELSE NULL END" 
)

Unweighted Counts

Count the unweighted number of records in the SQL table, overall and by groups:

dbGetQuery( db , "SELECT COUNT(*) FROM microdados_enem_2015" )

dbGetQuery( db ,
    "SELECT
        fathers_education ,
        COUNT(*) 
    FROM microdados_enem_2015
    GROUP BY fathers_education"
)

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT AVG( nota_mt ) FROM microdados_enem_2015" )

dbGetQuery( db , 
    "SELECT 
        fathers_education , 
        AVG( nota_mt ) AS mean_nota_mt
    FROM microdados_enem_2015 
    GROUP BY fathers_education" 
)

Calculate the distribution of a categorical variable:

dbGetQuery( db , 
    "SELECT 
        uf_residencia , 
        COUNT(*) / ( SELECT COUNT(*) FROM microdados_enem_2015 ) 
            AS share_uf_residencia
    FROM microdados_enem_2015 
    GROUP BY uf_residencia" 
)

Calculate the sum of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT SUM( nota_mt ) FROM microdados_enem_2015" )

dbGetQuery( db , 
    "SELECT 
        fathers_education , 
        SUM( nota_mt ) AS sum_nota_mt 
    FROM microdados_enem_2015 
    GROUP BY fathers_education" 
)

Calculate the 25th, median, and 75th percentiles of a linear variable, overall and by groups:

RSQLite::initExtension( db )

dbGetQuery( db , 
    "SELECT 
        LOWER_QUARTILE( nota_mt ) , 
        MEDIAN( nota_mt ) , 
        UPPER_QUARTILE( nota_mt ) 
    FROM microdados_enem_2015" 
)

dbGetQuery( db , 
    "SELECT 
        fathers_education , 
        LOWER_QUARTILE( nota_mt ) AS lower_quartile_nota_mt , 
        MEDIAN( nota_mt ) AS median_nota_mt , 
        UPPER_QUARTILE( nota_mt ) AS upper_quartile_nota_mt
    FROM microdados_enem_2015 
    GROUP BY fathers_education" 
)

Subsetting

Limit your SQL analysis to took mathematics exam with WHERE:

dbGetQuery( db ,
    "SELECT
        AVG( nota_mt )
    FROM microdados_enem_2015
    WHERE in_presenca_mt = 1"
)

Measures of Uncertainty

Calculate the variance and standard deviation, overall and by groups:

RSQLite::initExtension( db )

dbGetQuery( db , 
    "SELECT 
        VARIANCE( nota_mt ) , 
        STDEV( nota_mt ) 
    FROM microdados_enem_2015" 
)

dbGetQuery( db , 
    "SELECT 
        fathers_education , 
        VARIANCE( nota_mt ) AS var_nota_mt ,
        STDEV( nota_mt ) AS stddev_nota_mt
    FROM microdados_enem_2015 
    GROUP BY fathers_education" 
)

Regression Models and Tests of Association

Perform a t-test:

enem_slim_df <- 
    dbGetQuery( db , 
        "SELECT 
            nota_mt , 
            female ,
            uf_residencia
        FROM microdados_enem_2015" 
    )

t.test( nota_mt ~ female , enem_slim_df )

Perform a chi-squared test of association:

this_table <-
    table( enem_slim_df[ , c( "female" , "uf_residencia" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        nota_mt ~ female + uf_residencia , 
        data = enem_slim_df
    )

summary( glm_result )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for ENEM users, this code replicates previously-presented examples:

library(dplyr)
library(dbplyr)
dplyr_db <- dplyr::src_sqlite( dbdir )
enem_tbl <- tbl( dplyr_db , 'microdados_enem_2015' )

Calculate the mean (average) of a linear variable, overall and by groups:

enem_tbl %>%
    summarize( mean = mean( nota_mt ) )

enem_tbl %>%
    group_by( fathers_education ) %>%
    summarize( mean = mean( nota_mt ) )

Replication Example

dbGetQuery( db , "SELECT COUNT(*) FROM microdados_enem_2015" )