Censo Escolar (CENSOESCOLAR)

Build Status Build status

Contributed by Guilherme Jacob <guilhermejacob91@gmail.com>

The Brazilian School Census (Censo Escolar) is a massive source of information about basic education. Synthetic tables can be produced using the InepData interface.

Simplified Download and Importation

The R lodown package easily downloads and imports all available CENSOESCOLAR microdata by simply specifying "censoescolar" with an output_dir = parameter in the lodown() function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

library(lodown)
lodown( "censoescolar" , output_dir = file.path( path.expand( "~" ) , "CENSOESCOLAR" ) )

Analysis Examples with SQL and RSQLite  

Connect to a database:

library(DBI)
dbdir <- file.path( path.expand( "~" ) , "CENSOESCOLAR" , "SQLite.db" )
db <- dbConnect( RSQLite::SQLite() , dbdir )

Variable Recoding

Add new columns to the data set:

dbSendQuery( db , "ALTER TABLE matricula_2015 ADD COLUMN id_indigenous_area INTEGER" )

dbSendQuery( db ,
    "UPDATE matricula_2015
    SET id_indigenous_area =
        CASE WHEN ( tp_localizacao_diferenciada IN (2,5) ) THEN 1 ELSE 0 END"
)

dbSendQuery( db , "ALTER TABLE matricula_2015 ADD COLUMN id_public INTEGER" )

dbSendQuery( db ,
    "UPDATE matricula_2015
    SET id_public =
        CASE WHEN ( tp_dependencia <> 4 ) THEN 1 ELSE 0 END"
)

Unweighted Counts

Count the unweighted number of records in the SQL table, overall and by groups:

dbGetQuery( db , "SELECT COUNT(*) FROM matricula_2015" )

dbGetQuery( db ,
    "SELECT
        tp_localizacao ,
        COUNT(*) 
    FROM matricula_2015
    GROUP BY tp_localizacao"
)

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT AVG( nu_idade ) FROM matricula_2015" )

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        AVG( nu_idade ) AS mean_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Calculate the distribution of a categorical variable:

dbGetQuery( db , 
    "SELECT 
        tp_categoria_escola_privada , 
        COUNT(*) / ( SELECT COUNT(*) FROM matricula_2015 ) 
            AS share_tp_categoria_escola_privada
    FROM matricula_2015 
    GROUP BY tp_categoria_escola_privada" 
)

Calculate the sum of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT SUM( nu_idade ) FROM matricula_2015" )

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        SUM( nu_idade ) AS sum_nu_idade 
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Calculate the 25th, median, and 75th percentiles of a linear variable, overall and by groups:

RSQLite::initExtension( db )

dbGetQuery( db , 
    "SELECT 
        LOWER_QUARTILE( nu_idade ) , 
        MEDIAN( nu_idade ) , 
        UPPER_QUARTILE( nu_idade ) 
    FROM matricula_2015" 
)

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        LOWER_QUARTILE( nu_idade ) AS lower_quartile_nu_idade , 
        MEDIAN( nu_idade ) AS median_nu_idade , 
        UPPER_QUARTILE( nu_idade ) AS upper_quartile_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Subsetting

Limit your SQL analysis to students enrolled in public schools with WHERE:

dbGetQuery( db ,
    "SELECT
        AVG( nu_idade )
    FROM matricula_2015
    WHERE id_public = 1"
)

Measures of Uncertainty

Calculate the variance and standard deviation, overall and by groups:

RSQLite::initExtension( db )

dbGetQuery( db , 
    "SELECT 
        VARIANCE( nu_idade ) , 
        STDEV( nu_idade ) 
    FROM matricula_2015" 
)

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        VARIANCE( nu_idade ) AS var_nu_idade ,
        STDEV( nu_idade ) AS stddev_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Regression Models and Tests of Association

Perform a t-test:

censoescolar_slim_df <- 
    dbGetQuery( db , 
        "SELECT 
            nu_idade , 
            id_public ,
            tp_categoria_escola_privada
        FROM matricula_2015" 
    )

t.test( nu_idade ~ id_public , censoescolar_slim_df )

Perform a chi-squared test of association:

this_table <-
    table( censoescolar_slim_df[ , c( "id_public" , "tp_categoria_escola_privada" ) ] )

chisq.test( this_table )

Perform a generalized linear model:

glm_result <- 
    glm( 
        nu_idade ~ id_public + tp_categoria_escola_privada , 
        data = censoescolar_slim_df
    )

summary( glm_result )

Analysis Examples with dplyr  

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for CENSOESCOLAR users, this code replicates previously-presented examples:

library(dplyr)
library(dbplyr)
dplyr_db <- dplyr::src_sqlite( dbdir )
censoescolar_tbl <- tbl( dplyr_db , 'matricula_2015' )

Calculate the mean (average) of a linear variable, overall and by groups:

censoescolar_tbl %>%
    summarize( mean = mean( nu_idade ) )

censoescolar_tbl %>%
    group_by( tp_localizacao ) %>%
    summarize( mean = mean( nu_idade ) )

Replication Example

This snippet replicates the first three rows of total enrollment in basic education, as in the Table 1.1 of this spreadsheet.

# first row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) " )

# second row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) AND CO_REGIAO = 1" )

# third row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) AND CO_UF = 11 " )