Scraping Brazilian GDP data from the SIDRA-IBGE website

Scraping Brazilian GDP data from the SIDRA-IBGE website

Paulo Ferreira Naibert https://github.com/pfnaibert/
2020-08-24

Last Updated 2020-09-03

In this post, we will scrape the brazilian gdp data from the SIDRA-IBGE website. Here, I use my own functions (based on other codes) to download the sidra data, but be aware that theres is a CRAN package for that here. My source code for the functions can be found here.

Also, there is another post here that makes a lot of the things we will be doing.

Ok, disclaimers are out of the way. First, let’s load up our functions and other libraries.


# source functions
source("./myfuns.R")

# load libraries
library(dplyr)
library(reshape2)
library(lubridate)
library(dygraphs)
library(ggplot2)
library(ggthemes)

The gdp.dl.csv() function, downloads the data from the website and saves a .csv. The gdp.save.levels() and gdp.save.rets() functions load the .csv file, using the paths used in gdp.dl.csv(), transforms the data using the gdp.csv2df() function and saves a data.frame file with the .rds format. I separated the functions that download and trassform the data to avoid connecting to the internet every time I wanted to edit the .csv file. Also, the function df2ts() transforms the data.frame into a ts object because paged_table() and ggplot() do not accept ts objects, whereas dygraphs only accepts time series objects.


# download gdp
gdp.dl.csv()

# transform and save data
gdp.save.levels()
gdp.save.rets()

Now, let’s load our data.


# LOAD DATA

# levels
gdp.nominal  <- readRDS("../../data/gdp-nominal.rds")
gdp.real.NSA <- readRDS("../../data/gdp-real-NSA.rds")
gdp.real.SA  <- readRDS("../../data/gdp-real-SA.rds")

# rets
gdp.ret4    <- readRDS("../../data/gdp-ret4.rds")
gdp.ret1    <- readRDS("../../data/gdp-ret1.rds")
gdp.retsum4 <- readRDS("../../data/gdp-retsum4.rds")
gdp.retyear <- readRDS("../../data/gdp-retyear.rds")

# data as ts (for dygraphs)
gdp.nominal.ts  <- df2ts(gdp.nominal)
gdp.real.NSA.ts <- df2ts(gdp.real.NSA)
gdp.real.SA.ts  <- df2ts(gdp.real.SA)
gdp.ret4.ts     <- df2ts(gdp.ret4)
gdp.ret1.ts     <- df2ts(gdp.ret1)
gdp.retsum4.ts  <- df2ts(gdp.retsum4)
gdp.retyear.ts  <- df2ts(gdp.retyear)

# only GDPs
gdp.df <- data.frame( gdp.nominal[, c(1,2,3)],  "nominal"=gdp.nominal[, "GDP"], "real.NSA"=gdp.real.NSA[, "GDP"], "real.SA"=gdp.real.SA[, "GDP"],
                     "ret1"=gdp.ret1[, "GDP"], "ret4"=gdp.ret4[, "GDP"], "retsum4"=gdp.retsum4[, "GDP"] )
gdp.df$sum4.nomi    <- sum4(gdp.df$nominal)
gdp.df$sum4.real    <- sum4(gdp.df$real.NSA)
gdp.ts <- df2ts(gdp.df)

# show gdp.df
paged_table(gdp.df)

Cool, the data is loaded, let’s make a table with the annual gdp data starting in 2014.


tmp <- filter(gdp.df, date > "2014-01-01" & month(date) == "12" )
mat <- cbind( "Billions of current BRL"=tmp$sum4.nomi/1000, "Billions of 1995 BRL"=tmp$sum4.real/1000, "AC 4Q T/T-4"=tmp$retsum4)
rownames(mat) <- substring( date2qtr(tmp$date), 1, 4)
kable( t( apply( mat, 2, rev) ), digits=2, caption = "GDP by Year")
Table 1: GDP by Year
2019 2018 2017 2016 2015 2014
Billions of current BRL 7256.93 6889.18 6583.32 6269.33 5995.79 5778.95
Billions of 1995 BRL 1197.66 1184.20 1168.80 1153.54 1192.61 1236.45
AC 4Q T/T-4 1.10 1.30 1.30 -3.30 -3.50 0.50

Now, let’s make a table with quarterly gdp data starting from 2019.


tmp <- filter(gdp.df, date > "2019-01-01" )
mat <- cbind( "Billions of current BRL"=tmp$nominal/1000, "T/T-1"=tmp$ret1, "T/T-4"=tmp$ret4, "AC 4Q T/T-4"=tmp$retsum4)
rownames(mat) <- date2qtr(tmp$date)
kable( t( apply( mat, 2, rev) ), digits=2, caption = "GDP by Quarter")
Table 2: GDP by Quarter
2020:Q2 2020:Q1 2019:Q4 2019:Q3 2019:Q2 2019:Q1
Billions of current BRL 1652.95 1803.42 1892.74 1842.7 1795.81 1725.68
T/T-1 -9.70 -2.50 0.50 0.1 0.50 0.60
T/T-4 -11.40 -0.30 1.70 1.2 1.10 0.60
AC 4Q T/T-4 -2.20 0.90 1.10 1.0 1.10 1.10

Tables are ready, let’s graph the GDP percent change for better visualization in ggplot:


gdp.df %>%
filter(date > "2014-01-01") %>%
select(date, ret1, ret4, retsum4) %>%
melt("date") %>%
ggplot( aes( x = date, y = value, fill = variable) ) +
geom_bar( stat="identity", position="dodge" ) +
scale_fill_discrete(name=" ",
                    breaks=c("ret1", "ret4", "retsum4"),
                    labels=c("T/T-1", "T/T-4", "T/T-4 AC4Q") ) +
xlab("") + ylab("") +
ggtitle( "Brazilian GDP %Change" ) +
theme( plot.title = element_text(hjust = 0.5, face = "bold" ) ) +
theme_economist()

and interactive version with dygraphs:


tmp <- window(gdp.ts, start=c(2014,1) )
dygraph( tmp[, c("ret1", "ret4", "retsum4") ], main="Brazilian GDP %Change" ) %>%
dySeries("ret1", label = "T/T-1") %>%
dySeries("ret4", label = "T/T-4") %>%
dySeries("retsum4", label = "AC 4Q T/T-4") %>%
dyAxis("x", rangePad = 20, drawGrid = FALSE) %>%
dyBarChart()

Now, let’s graph the level of the GDP using the Index of chained 1995 BRL, seasonally adjusted and not seasonally adjusted. In ggplot2:


den.SA  <- mean(gdp.df$real.SA[year(gdp.df$date)=="2014"] )
den.NSA <- mean(gdp.df$real.NSA[year(gdp.df$date)=="2014"] )

gdp.df %>%
select(date, real.SA, real.NSA) %>%
mutate(real.SA = 100*real.SA/den.SA, real.NSA = 100*real.NSA/den.NSA ) %>%
melt(id = "date") %>%
ggplot( aes(x = date, y = value) ) +
  ggtitle( "GDP Index of 1995 Chained BRL (2014=100)" ) +
  geom_line(aes(color = variable), size = 1) +
  scale_color_manual(name = "", values = c("blue", "red"), labels=c("Seasonal Adj.", "Not Seasonal Adj." ) ) +
  theme( plot.title = element_text(hjust = 0.5, face = "bold" ) ) +
  xlab("") + ylab("") +
  theme_economist()

And with dygraphs:


tmp <- cbind( "real.SA"=normalize.yr( gdp.ts[, "real.SA"], 2014), "real.NSA"=normalize.yr(gdp.ts[, "real.NSA"], 2014 ) )
dygraph(tmp, main="GDP Index of 1995 Chained BRL (2014=100)" ) %>%
dyAxis("x", rangePad = 10, drawGrid = FALSE) %>%
dyEvent("2002-01-1", "Lula",  labelLoc = "bottom") %>%
dyEvent("2010-01-1", "Dilma", labelLoc = "bottom") %>%
dyEvent("2016-09-1", "Temer", labelLoc = "bottom") %>%
dyEvent("2019-01-1", "Bolsonaro", labelLoc = "bottom") %>%
dyEvent("2008-10-1", "Subprime", labelLoc = "bottom", color="red") %>%
dyEvent("2020-03-01", "Corona", labelLoc = "bottom", color="red") %>%
dyLimit( max(tmp), "Peak", labelLoc = "left", color="blue" ) %>%
dySeries("real.NSA", label = "NSA") %>%
dySeries("real.SA", label = "SA")

We could (does not mean that we should) also normalize by an specific quarter.


den.SA  <- gdp.df$real.SA[gdp.df$date=="2013-09-30"]
den.NSA <- gdp.df$real.NSA[gdp.df$date=="2013-09-30"]

gdp.df %>%
select(date, real.SA, real.NSA) %>%
mutate(real.SA = 100*real.SA/den.SA, real.NSA = 100*real.NSA/den.NSA ) %>%
melt(id = "date") %>%
ggplot( aes(x = date, y = value) ) +
  ggtitle( "GDP Index of 1995 Chained BRL (2013:Q3=100)" ) +
  geom_line(aes(color = variable), size = 1) +
  scale_color_manual(name = "", values = c("blue", "red"), labels=c("Seasonal Adj.", "Not Seasonal Adj." ) ) +
  theme( plot.title = element_text(hjust = 0.5, face = "bold" ) ) +
  xlab("") + ylab("") +
  theme_economist()


tmp <- cbind( "real.SA"=normalize( gdp.ts[, "real.SA"], c(2013, 3)), "real.NSA"=normalize(gdp.ts[, "real.NSA"], c(2013,3) ) )
dygraph(tmp, main="GDP Index of 1995 Chained BRL, (2013:Q3=100)" ) %>%
dyAxis("x", rangePad = 10, drawGrid = FALSE) %>%
dyEvent("2002-01-1", "Lula",  labelLoc = "bottom") %>%
dyEvent("2010-01-1", "Dilma", labelLoc = "bottom") %>%
dyEvent("2016-09-1", "Temer", labelLoc = "bottom") %>%
dyEvent("2019-01-1", "Bolsonaro", labelLoc = "bottom") %>%
dyEvent("2008-10-1", "Subprime", labelLoc = "bottom", color="red") %>%
dyEvent("2020-03-01", "Corona", labelLoc = "bottom", color="red") %>%
dyLimit( max(tmp), "Peak", labelLoc = "left", color="blue" ) %>%
dySeries("real.NSA", label = "NSA") %>%
dySeries("real.SA", label = "SA")

So that is our little report of the brazilian GDP series.

Extensions

On future posts I intend to break it up by sectors and components.

Automation and cron jobs

Also, we can run the gdp.dl.csv() and the gdp.save.levels() and gdp.save.rets() functions separately on an executable script using cron jobs (more details here) to autmate the scraping part. The rest of the script is completely reusable to automate reporting.

Future dates

Finally, here we can find out when the next research will be published:

Reference Date Publishing Date
2020:Q2 “2020-09-01”
2020:Q3 “2020-12-03”

Stay tuned.

Thank you for reading!

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.