Censo Escolar (CENSO_ESCOLAR)

Build Status Build status

Contributed by Guilherme Jacob <guilhermejacob91@gmail.com>

The Brazilian School Census (Censo Escolar) is a massive source of information about basic education. Synthetic tables can be produced using the InepData interface.

Simplified Download and Importation

The R lodown package easily downloads and imports all available CENSO_ESCOLAR microdata by simply specifying "censo_escolar" with an output_dir = parameter in the lodown() function. Depending on your internet connection and computer processing speed, you might prefer to run this step overnight.

library(lodown)
lodown( "censo_escolar" , output_dir = file.path( path.expand( "~" ) , "CENSO_ESCOLAR" ) )

0.10 Analysis Examples with SQL and MonetDBLite

Connect to a database:

library(DBI)
dbdir <- file.path( path.expand( "~" ) , "CENSO_ESCOLAR" , "MonetDB" )
db <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )

Variable Recoding

Add new columns to the data set:

dbSendQuery( db , "ALTER TABLE matricula_2015 ADD COLUMN id_indigenous_area INTEGER" )

dbSendQuery( db ,
    "UPDATE matricula_2015
    SET id_indigenous_area =
        CASE WHEN ( tp_localizacao_diferenciada IN (2,5) ) THEN 1 ELSE 0 END"
)

dbSendQuery( db , "ALTER TABLE matricula_2015 ADD COLUMN id_public INTEGER" )

dbSendQuery( db ,
    "UPDATE matricula_2015
    SET id_public =
        CASE WHEN ( tp_dependencia <> 4 ) THEN 1 ELSE 0 END"
)

Unweighted Counts

Count the unweighted number of records in the SQL table, overall and by groups:

dbGetQuery( db , "SELECT COUNT(*) FROM matricula_2015" )

dbGetQuery( db ,
    "SELECT
        tp_localizacao ,
        COUNT(*) 
    FROM matricula_2015
    GROUP BY tp_localizacao"
)

Descriptive Statistics

Calculate the mean (average) of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT AVG( nu_idade ) FROM matricula_2015" )

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        AVG( nu_idade ) AS mean_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Initiate a function that allows division by zero:

dbSendQuery( db , 
    "CREATE FUNCTION 
        div_noerror(l DOUBLE, r DOUBLE) 
    RETURNS DOUBLE 
    EXTERNAL NAME calc.div_noerror" 
)

Calculate the distribution of a categorical variable:

dbGetQuery( db , 
    "SELECT 
        tp_categoria_escola_privada , 
        div_noerror( 
            COUNT(*) , 
            ( SELECT COUNT(*) FROM matricula_2015 ) 
        ) AS share_tp_categoria_escola_privada
    FROM matricula_2015 
    GROUP BY tp_categoria_escola_privada" 
)

Calculate the sum of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT SUM( nu_idade ) FROM matricula_2015" )

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        SUM( nu_idade ) AS sum_nu_idade 
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Calculate the median (50th percentile) of a linear variable, overall and by groups:

dbGetQuery( db , "SELECT QUANTILE( nu_idade , 0.5 ) FROM matricula_2015" )

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        QUANTILE( nu_idade , 0.5 ) AS median_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Subsetting

Limit your SQL analysis to students enrolled in public schools with WHERE:

dbGetQuery( db ,
    "SELECT
        AVG( nu_idade )
    FROM matricula_2015
    WHERE id_public = 1"
)

Measures of Uncertainty

Calculate the variance and standard deviation, overall and by groups:

dbGetQuery( db , 
    "SELECT 
        VAR_SAMP( nu_idade ) , 
        STDDEV_SAMP( nu_idade ) 
    FROM matricula_2015" 
)

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        VAR_SAMP( nu_idade ) AS var_nu_idade ,
        STDDEV_SAMP( nu_idade ) AS stddev_nu_idade
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

Regression Models and Tests of Association

Calculate the correlation between two variables, overall and by groups:

dbGetQuery( db , 
    "SELECT 
        CORR( CAST( id_public AS DOUBLE ) , CAST( nu_idade AS DOUBLE ) )
    FROM matricula_2015" 
)

dbGetQuery( db , 
    "SELECT 
        tp_localizacao , 
        CORR( CAST( id_public AS DOUBLE ) , CAST( nu_idade AS DOUBLE ) )
    FROM matricula_2015 
    GROUP BY tp_localizacao" 
)

0.11 Analysis Examples with dplyr

The R dplyr library offers an alternative grammar of data manipulation to base R and SQL syntax. dplyr offers many verbs, such as summarize, group_by, and mutate, the convenience of pipe-able functions, and the tidyverse style of non-standard evaluation. This vignette details the available features. As a starting point for CENSO_ESCOLAR users, this code replicates previously-presented examples:

library(dplyr)
dplyr_db <- MonetDBLite::src_monetdblite( dbdir )
censo_escolar_tbl <- tbl( dplyr_db , 'matricula_2015' )

Calculate the mean (average) of a linear variable, overall and by groups:

censo_escolar_tbl %>%
    summarize( mean = mean( nu_idade ) )

censo_escolar_tbl %>%
    group_by( tp_localizacao ) %>%
    summarize( mean = mean( nu_idade ) )

Replication Example

This snippet replicates the first three rows of total enrollment in basic education, as in the Table 1.1 of this spreadsheet.

# first row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) " )

# second row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) AND CO_REGIAO = 1" )

# third row:
dbGetQuery( db ,"SELECT COUNT(*) AS n_mat_tot
 FROM matricula_2015
 WHERE TP_TIPO_TURMA NOT IN (4,5) AND CO_UF = 11 " )

Database Shutdown

dbDisconnect( db , shutdown = TRUE )