Resources and further references | Data Analytics: Learning from Data

Finance Data

Fri, 31 Jul 2020 00:00:00 +0000

“Data!data!data!” he cried impatiently. “I can’t make bricks without clay.”
–Arthur Conan Doyle, The Adventure of the Copper Beeches

The easiest way to download data is if someone makes available a CSV file and we can download it directly off the web with readr::read_csv()or with data.table::fread(). Alternatively, we can use the rio package to download many different types of files (Excel, SPSS, Stata, etc.)

In this section we will look at three packages that use wrapped Application Programming Interface (APIs) to get data off the web:

tidyquant to get finance data
wbstats to get data from the World Bank database, and
eurostat to get Eurostat data.

Finance data with the `tidyquant` package

The tidyquant package comes with a number of functions- utlities that allow us to download financial data off the web, as well as ways of handling all this data.

We begin by loading the data set into the R workspace. We create a collection of stocks with their ticker symbols and then use the piping operator %>% to use tidyquant’s tq_get to donwload historical data using Yahoo finance and, again, to group data by their ticker symbol.

library(tidyquant)
myStocks <- c("AAPL","JPM","DIS","DPZ","ANF","TSLA","XOM","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2020-07-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame

## Rows: 19,280
## Columns: 8
## Groups: symbol [8]
## $ symbol   <chr> "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "AAPL", "A...
## $ date     <date> 2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-0...
## $ open     <dbl> 46.5, 47.5, 47.1, 47.8, 47.7, 48.4, 49.3, 49.0, 49.3, 49.4...
## $ high     <dbl> 47.2, 47.5, 47.8, 47.9, 48.0, 49.0, 49.3, 49.2, 49.5, 49.8...
## $ low      <dbl> 46.4, 46.9, 47.1, 47.6, 47.4, 48.2, 48.5, 48.9, 49.1, 49.2...
## $ close    <dbl> 47.1, 47.3, 47.7, 47.7, 48.0, 48.9, 48.8, 49.2, 49.4, 49.8...
## $ volume   <dbl> 1.11e+08, 7.73e+07, 6.39e+07, 7.51e+07, 7.80e+07, 1.12e+08...
## $ adjusted <dbl> 40.8, 41.0, 41.3, 41.3, 41.6, 42.4, 42.3, 42.6, 42.8, 43.1...

For each ticker symbol, the data frame contains its symbol, the date, the prices for open,high, low and close, and the volume, or how many stocks were traded on that day. More importantly, the data frame contains the adjusted closing price, which adjusts for any stock splits and/or dividends paid and this is what we will be using for our analyses.

Calculating financial returns

Financial performance and CAPM analysis depend on returns and not on adjusted closing prices. So given the adjusted closing prices, our first step is to calculate daily and monthly returns.

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily.returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly.returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly.returns",
               cols = c(nested.col))

For yearly and monthly data, we assume discrete changes, so we the formula used to calculate the return for month (t+1) is

$Return(t+1)= \frac{Adj.Close(t+1)}{Adj.Close (t)}-1$

For daily data we use log returns, or $Return(t+1)= LN\frac{Adj.Close(t+1)}{Adj.Close (t)}$

The reason we use log returns are:

Compound interest interpretation; namely, that the log return can be interpreted as the continuously (rather than discretely) compounded rate of return
Log returns are assumed to follow a normal distribution
Log return over n periods is the sum of n log returns

Summarising the data set

Let us get quick summary statistics of daily returns for each stock, as well as a density plot whwre we use facet_grid to superimpose all the distributions in one plot.

symbol	min	median	max	mean	sd	annual_mean	annual_sd
AAPL	-0.138	0.001	0.113	0.001	0.017	0.233	0.276
ANF	-0.307	0.001	0.296	-0.001	0.034	-0.152	0.540
DIS	-0.139	0.001	0.135	0.001	0.015	0.129	0.241
DPZ	-0.106	0.001	0.228	0.001	0.018	0.344	0.287
JPM	-0.162	0.001	0.166	0.000	0.018	0.111	0.284
SPY	-0.116	0.001	0.087	0.000	0.011	0.117	0.172
TSLA	-0.215	0.001	0.218	0.002	0.034	0.417	0.535
XOM	-0.130	0.000	0.119	0.000	0.015	-0.026	0.232

Daily returns seem to follow a normal distribution with a mean close to zero. Since most people think of returns on an annual, rather than on a daily basis, we can calculate summary statistics of annual returns, a boxplot of annual returns, and a bar plot that shows return for each stock on a year-by-year basis.

myStocks_returns_annual %>% 
  group_by(symbol) %>% 
  mutate(median_return= median(yearly.returns)) %>% 

  # arrange stocks by median yearly return, so highest median return appears first, etc.   
  ggplot(aes(x=reorder(symbol, median_return), y=yearly.returns, colour=symbol)) +
  geom_boxplot()+
  coord_flip()+
  labs(x="Stock", 
       y="Returns", 
       title = "Boxplot of Annual Returns")+
  scale_y_continuous(labels = scales::percent_format(accuracy = 2))+
  guides(color=FALSE) +
  theme_bw()+
  NULL

ggplot(myStocks_returns_annual, aes(x=year(date), y=yearly.returns, fill=symbol)) +
  geom_col(position = "dodge")+
  labs(x="Year", y="Returns", title = "Annual Returns")+
  scale_y_continuous(labels = scales::percent)+
  guides(fill=guide_legend(title=NULL))+
  theme_bw()+
  NULL

Minimum and maximum price of each stock by quarter

What if we wanted to find out and visualise the min/max price by quarter?

Sharpe Ratio

The Sharpe ratio, introduced by William F. Sharpe, is used to understand the return of an investment compared to its risk. It is simply the return on an asset per unit of risk, with the unit of risk typically being the standard deviation of the returns of that particular asset.

Mathematically, the ratio is the average return earned in excess of the risk-free rate per unit of volatility. $Sharpe Ratio = \frac{R_{p}-R_{f}}{\sigma_{p}}$

Generally, the greater the value of the Sharpe ratio, the more attractive the risk-adjusted return.

myStocks_returns_monthly %>%
  tq_performance(Ra = monthly.returns, #the name of the variable containing the returns of the asset
                 Rb = NULL, 
                 performance_fun = SharpeRatio) %>% 
  kable() %>%
  kable_styling(c("striped", "bordered"))

symbol	ESSharpe(Rf=0%,p=95%)	StdDevSharpe(Rf=0%,p=95%)	VaRSharpe(Rf=0%,p=95%)
AAPL	0.163	0.296	0.211
JPM	0.068	0.166	0.102
DIS	0.104	0.203	0.147
DPZ	0.313	0.427	0.416
ANF	-0.010	-0.022	-0.014
TSLA	0.207	0.286	0.329
XOM	-0.002	-0.006	-0.003
SPY	0.119	0.280	0.193

Investment Growth

Finally, we may want to see what our investments would have grown to, if we had invested $1000 in each of the assets on Jan 1, 2011.

Scatterplots of individual stocks returns versus S&P500 Index returns

Besides these exploratory graphs of returns and price evolution, we also need to create scatterplots among the returns of different stocks. ggpairs from the GGally package creates a scattterplot matrix that shows the distribution of returns for each stock along the diagonal, and scatter plots and correlations for each pair of stocks. Running a ggpairs() correlation scatterplot-matrix typically takes a while to run.

#calculate daily returns
table_capm_returns <- myStocks_returns_daily %>%
            spread(key = symbol, value = daily.returns)  #just keep the period returns grouped by symbol

table_capm_returns[-1] %>% #exclude "Date", the first column, from the correlation matrix
  GGally::ggpairs() +
  theme_bw()+
    theme(axis.text.x = element_text(angle = 90, size=8),
         axis.title.x = element_blank())

Creating a portfolio of assets

DPZ may have been the best performing stock, but you believe that you can create a portfolio of technology stocks that will beat the relevant sector index, XLK. To create a portfolio, you need to choose a few stocks and then the weights, or how much of your total investment is allocated to each stock. To keep things simple we will assume you will choose among AAPL, GOOG, MSFT, NFLX, and NVDA and you will compare your performance against the sector index, XLK. We will also add two non-tech stocks, TSLA and DPZ so we can their position on the risk/return frontier.

ticker_symbols <- c("AAPL","GOOG","MSFT","NFLX","NVDA", "XLK", "TSLA", "DPZ") 

tech_stock_returns_monthly <- ticker_symbols %>%
    tq_get(get  = "stock.prices",
           from = "2011-01-01",
           to   = "2020-07-31") %>%
    group_by(symbol) %>%
    tq_transmute(select     = adjusted, 
                 mutate_fun = periodReturn, 
                 period     = "monthly", 
                 col_rename = "monthly_return")


baseline_returns_monthly <- "XLK" %>%
    tq_get(get  = "stock.prices",
           from = "2011-01-01",
           to   = "2020-07-31") %>%
    tq_transmute(select     = adjusted, 
                 mutate_fun = periodReturn, 
                 period     = "monthly", 
                 col_rename = "baseline_return")

# Summary Stats for individual Stocks
stocks_risk_return <- tech_stock_returns_monthly %>%
  tq_performance(Ra = monthly_return, Rb = NULL, performance_fun = table.Stats) %>% 
  select(symbol, ArithmeticMean, GeometricMean, Minimum,Maximum,Stdev, Quartile1, Quartile3) 



ggplot(stocks_risk_return, aes(x=Stdev, y = ArithmeticMean, colour= symbol, label= symbol))+
  geom_point(size = 4)+
  labs(title = 'Risk/Return profile of technology stocks', 
       x = 'Risk (stdev of monthly returns)', 
       y ="Average monthly return")+
  theme_bw()+
  scale_x_continuous(labels = scales::percent)+
  scale_y_continuous(labels = scales::percent)+
  geom_text_repel()+
  theme(legend.position = "none")

We have the monthly returns of the individual stocks and the relevenant sector index. To create a portfolio, we must specify the weights; as an example, suppose we only choose three stocks and invest 50% in AAPL, 35% in NFLX, and 15% in NVDA. To do this, we create a two-column tibble, with symbols in the first column and weights in the second; any symbol not specified by default gets a weight of zero.

weights_map <- tibble(
    symbols = c("AAPL", "NFLX", "NVDA"),
    weights = c(0.5, 0.35, 0.15)
)

tech_portfolio_returns <- tech_stock_returns_monthly %>%
    tq_portfolio(assets_col  = symbol, 
                 returns_col = monthly_return, 
                 weights     = weights_map, 
                 col_rename  = "monthly_portfolio_return")

tech_portfolio_returns %>%
    ggplot(aes(x = date, y = monthly_portfolio_return)) +
    geom_col() +
    scale_y_continuous(labels = scales::percent) +
    # geom_bar(stat = "identity", fill = palette_light()[[1]]) +
    labs(title = "Tech Portfolio Returns",
         subtitle = "50% AAPL, 35% NFLX, and 15% NVDA",
         x = "", y = "Monthly Returns") +
    theme_bw()

portfolio_growth_monthly <- tech_stock_returns_monthly %>%
    tq_portfolio(assets_col   = symbol, 
                 returns_col  = monthly_return, 
                 weights      = weights_map, 
                 col_rename   = "investment.growth",
                 wealth.index = TRUE) %>%
    mutate(investment.growth = investment.growth * 1000)

plot1 <- portfolio_growth_monthly %>%
    ggplot(aes(x = date, y = investment.growth)) +
    geom_line(size = 2) +
    labs(title = "Portfolio Growth",
         subtitle = "50% AAPL, 35% NFLX, and 15% NVDA",
         x = "", y = "Portfolio Value") +
    # geom_smooth(method = "loess", se = FALSE) +
    theme_bw() +
    scale_y_continuous(labels = scales::dollar)

Now that we have our portfolio returns and the baseline returns of the XLK index, we can merge to get our consolidated table of asset and baseline returns, create a scatter plot and fit a CAPM model.

tech_single_portfolio <- left_join(tech_portfolio_returns, 
                                   baseline_returns_monthly,
                                   by = "date")
tech_single_portfolio

## # A tibble: 115 x 3
##    date       monthly_portfolio_return baseline_return
##    <date>                        <dbl>           <dbl>
##  1 2011-01-31                  0.162           0.0204 
##  2 2011-02-28                 -0.00466         0.0219 
##  3 2011-03-31                  0.0122         -0.0155 
##  4 2011-04-29                  0.00601         0.0261 
##  5 2011-05-31                  0.0609         -0.0105 
##  6 2011-06-30                 -0.0586         -0.0248 
##  7 2011-07-29                  0.0592          0.00428
##  8 2011-08-31                 -0.0596         -0.0531 
##  9 2011-09-30                 -0.215          -0.0307 
## 10 2011-10-31                 -0.00422         0.102  
## # ... with 105 more rows

ggplot(tech_single_portfolio, aes(x = baseline_return, y= monthly_portfolio_return)) +
  geom_point()+
  geom_smooth(method="lm", se=FALSE) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Baseline returns (XLK)", 
       y= "Tech Portfolio Return", 
       title= "How do our tech fund returns compare to the the sector index XLK")

portfolio_CAPM <- lm(monthly_portfolio_return ~ baseline_return, data = tech_single_portfolio)
summary(portfolio_CAPM)

## 
## Call:
## lm(formula = monthly_portfolio_return ~ baseline_return, data = tech_single_portfolio)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1841 -0.0375 -0.0016  0.0326  0.1618 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.00836    0.00579    1.44     0.15    
## baseline_return  1.27904    0.12865    9.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0586 on 113 degrees of freedom
## Multiple R-squared:  0.467,  Adjusted R-squared:  0.462 
## F-statistic: 98.8 on 1 and 113 DF,  p-value: <2e-16

autoplot(portfolio_CAPM, which = 1:3) +
  theme_bw()

Creating various portfolios by changing weights of assets

Suppose we wanted to examine a few more portfolios by varying the weights.

Naive portfolio: you split your investment equally among the five stocks, so each of them has a weight of 20%
Bitcoin mining: you invest 80-20 in NVDA and GOOG
Binge TV watching: you invest most (70%) in NFLX and 10% to AAPL, GOOG, and MSFT

ticker_symbols = c("AAPL", "GOOG", "MSFT", "NFLX", "NVDA")

weights <- c(
    0.2, 0.2, 0.2, 0.2, 0.2,
    0, 0.2, 0, 0, 0.8,
    0.1, 0.1, 0.1, 0, 0.7
)

weights_table <-  tibble(ticker_symbols) %>%
    tq_repeat_df(n = 3) %>%
    bind_cols(tibble(weights)) %>%
    group_by(portfolio)


stock_returns_monthly_multi <- tech_stock_returns_monthly %>%
    tq_repeat_df(n = 3)

# Calculate montly returns for all portfolios
portfolio_returns_monthly_multi <- stock_returns_monthly_multi %>%
    tq_portfolio(assets_col   = symbol, 
                 returns_col  = monthly_return, 
                 weights      = weights_table, 
                 col_rename   = "portfolio_return",
                 wealth.index = FALSE) 

# Calculate what an investment of 1000 will grow to 
portfolio_growth_monthly_multi <- stock_returns_monthly_multi %>%
    tq_portfolio(assets_col   = symbol, 
                 returns_col  = monthly_return, 
                 weights      = weights_table, 
                 col_rename   = "investment.growth",
                 wealth.index = TRUE) %>%
    mutate(investment.growth = investment.growth * 1000)

portfolio_growth_monthly_multi %>%
  ggplot(aes(x = date, y = investment.growth, colour = as.factor(portfolio))) +
  geom_line(size = 2) +
  labs(title = "Portfolio Growth",
       subtitle = "Comparing Multiple Portfolios",
       x = "", y = "Portfolio Value",
       color = "Portfolio") +
  theme_bw()+
  scale_y_continuous(labels = scales::dollar)+
  scale_colour_discrete(name="Portfolio",
                      labels=c("Naive", "Bitcoiners", "Binge Watchers"))

# Returns a basic set of statistics that match the period of the data passed in (e.g., monthly returns 
# will get monthly statistics, daily will be daily stats, and so on).

portfolio_risk_return <- portfolio_returns_monthly_multi %>%
  tq_performance(Ra = portfolio_return, Rb = NULL, performance_fun = table.Stats) %>% 
  select(portfolio, ArithmeticMean, GeometricMean, Minimum,Maximum,Stdev, Quartile1, Quartile3) 

portfolio_risk_return %>% 
  kable() %>%
  kable_styling(c("striped", "bordered"))

portfolio	ArithmeticMean	GeometricMean	Minimum	Maximum	Stdev	Quartile1	Quartile3
1	0.026	0.023	-0.176	0.232	0.069	-0.017	0.076
2	0.033	0.028	-0.242	0.408	0.104	-0.028	0.079
3	0.032	0.028	-0.232	0.360	0.098	-0.026	0.074

ggplot(portfolio_risk_return, 
       aes(x=Stdev, 
           y = ArithmeticMean,
           label= portfolio, 
           colour= as.factor(portfolio)))+
  geom_point(size = 4)+
  labs(title = 'Risk/Return profile of the three portfolios', 
       x = 'Risk (stdev of monthly returns)', 
       y ="Average monthly return")+
  theme_bw()+
  scale_x_continuous(labels = scales::percent)+
  scale_y_continuous(labels = scales::percent)+
  scale_colour_discrete(name="Portfolio",
                      labels=c("Naive", "Bitcoiners", "Binge Watchers"))+
  geom_text_repel()

Data from the Federal Reserve Economic Data with `tidyquant`

A lot of economic data can be extracted from the Federal Reserve Economic Data (FRED) database. For each data we are interested, we need to get its FRED symbol; for instance, if we cared about commodities, we can select the Henry Hub Natural Gas Spot Price and notice that its FRED symbol is DHHNGSP.

So, if we wanted to download this, as well as prices of WTI crude, gold, and USD:EUR, we first identify the FRED codes which are shown below

Henry Hub Natural Gas Spot Price: DHHNGSP
WTI Crude Oil Prices: DCOILWTICO
Gold Fixing Price:GOLDAMGBD228NLBM
U.S. / Euro Exchange Rate: DEXUSEU

To get the data and plot it

natgas_spot  <-   tq_get("DHHNGSP", get = "economic.data",
                       from = "2011-01-01",
                       to   = "2020-07-31")

ggplot(natgas_spot, aes(x=date, y=price)) +
  geom_line()+
  labs(x="Year", 
       y="NatGas Spot price", 
       title = "Henry Hub Natural Gas Spot Prices")+
  scale_y_continuous(labels = scales::dollar)+
  guides(fill=guide_legend(title=NULL))+
  theme_bw()+
  NULL

wti_price  <-   tq_get("DCOILWTICO", get = "economic.data",
                       from = "2011-01-01",
                       to   = "2020-07-31")

ggplot(wti_price, aes(x=date, y=price)) +
  geom_line()+
  labs(x="Year", 
       y="WTI price", 
       title = "West Texas Intermediate Crude Oil (WTI) Prices")+
  scale_y_continuous(labels = scales::dollar)+
  guides(fill=guide_legend(title=NULL))+
  theme_bw()+
  NULL

gold_price  <-   tq_get("GOLDAMGBD228NLBM", get = "economic.data",
                        from = "2011-01-01",
                        to   = "2020-07-31") 

ggplot(gold_price, aes(x=date, y=price)) +
  geom_line()+
  labs(x="Year", 
       y="Gold price", 
       title = "Gold Fixing Price 10:30 A.M. (London time) in London Bullion Market")+
  scale_y_continuous(labels = scales::dollar)+
  guides(fill=guide_legend(title=NULL))+
  theme_bw()+
  NULL

USDEUR_rate <-   tq_get("DEXUSEU", get = "economic.data",
                        from = "2011-01-01",
                        to   = "2020-07-31") 

ggplot(USDEUR_rate, aes(x=date, y=price)) +
  geom_line()+
  labs(x="Year", 
       y="Exchange rate", 
       title = "USD to EUR Exchange Rate")+
  scale_y_continuous(labels = scales::dollar)+
  guides(fill=guide_legend(title=NULL))+
  theme_bw()+
  NULL

Now suppose we wanted to check if there is any correlation between natgas spot prices, WTI, and Gold prices. We will download prices, then calculate returns, calculate statistics on daily returns, and visualise some of the returns.

commodities <- c("DHHNGSP", "DCOILWTICO", "GOLDAMGBD228NLBM")

commodities_prices  <- tq_get(commodities, get = "economic.data",
                              from = "2011-01-01",
                              to   = "2020-07-31") %>% 
  group_by(symbol) 


commodities_returns_daily <- commodities_prices %>% na.omit() %>% 
  tq_transmute(select     = price, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily.returns")  

#calculate monthly  returns
commodities_returns_monthly <- commodities_prices %>%
  tq_transmute(select     = price, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly.returns") 

favstats(daily.returns ~ symbol,  data=commodities_returns_daily) %>% 
  mutate(
    annual_mean = mean *250,
    annual_sd = sd * sqrt(250)
  ) %>% 
  select(symbol, min, median, max, mean, sd, annual_mean, annual_sd)  %>% 
  kable() %>%
  kable_styling(c("striped", "bordered"))

symbol	min	median	max	sd	annual_mean	annual_sd
DCOILWTICO	-0.281	0.001	0.426	0.030	-0.008	0.474
DHHNGSP	-0.476	0.000	0.525	0.042	-0.093	0.668
GOLDAMGBD228NLBM	-0.089	0.000	0.068	0.010	0.034	0.154

ggplot(commodities_returns_daily, aes(x=daily.returns, fill=symbol))+
  geom_density()+
  coord_cartesian(xlim=c(-0.05,0.05)) + 
  scale_x_continuous(labels = scales::percent_format(accuracy = 2))+
  facet_grid(rows = (vars(symbol))) + 
  theme_bw()+
  labs(x="Daily Returns", 
       y="Density", 
       title = "Charting the Distribution of Daily Log Returns")+
  guides(fill=FALSE) +
  NULL

ggplot(commodities_returns_daily, aes(x=symbol, y=daily.returns))+
  geom_boxplot(aes(colour=symbol))+
  coord_flip()+
  scale_y_continuous(labels = scales::percent_format(accuracy = 2))+
  theme_bw()+
  labs(x="Daily Returns", 
       y="", 
       title = "Boxplot of Daily Log Returns")+
  theme(legend.position="none") +
  NULL

commodities_returns_daily %>% 
  pivot_wider(names_from="symbol", values_from="daily.returns") %>% 
  na.omit() %>% 
  select(-date) %>% 
  dplyr::rename(
    "NatGas" = 'DHHNGSP',
    "WTI Oil" = 'DCOILWTICO',
    "Gold" = 'GOLDAMGBD228NLBM'
  ) %>% 
  ggpairs()+
  theme_bw()

Acknowledgments

This page is derived in part from Performance Analytics with tidyquant by Matt Dancho.

Installing R and RStudio

Sat, 25 Jul 2020 00:00:00 +0000

In this section we download and install R and R Studio, and then show you how to write R commands and navigate around the RStudio interface. The goal in this chapter is not to learn any statistical or programming concepts: we’re just trying to learn how R works and get comfortable interacting with the system. We’ll spend a bit of time using R as a simple calculator. Specifically, we will learn the basics of R and RStudio, namely

How to install R and RStudio interface
How to navigate around the RStudio interface; a free Integrated Development Environment (IDE) for R
How to install and load packages that provide extra functionality for R

Installing R & RStudio

An important distinction to remember is between the R programming language itself, and the software you use to interact with R. You could choose to interact with R directly from the terminal, but that’s painful, so most people use an integrated development environment (IDE), which takes care of a lot of boring tasks for you. To get started, make sure you have both R and RStudio installed on your computer. Both are free and open source, and for most people they should be straightforward to install.

Install `XCode` if you have a Mac

If you have a Mac make sure that before installing R and R studio you
- upgrade to the latest version of macOS
- install XCode through the appStore

Install R

First you need to install R itself (the engine). Go to the CRAN (Collective R Archive Network)– this is the site where R itself and most R packages live. Click on “Download R for XXX”, where XXX is either Mac or Windows:

Double click on the downloaded file. Click *Yes** through all the prompts to install like any other program. once finished, proceed to install R Studio.

Install RStudio IDE

Go to the R studio website, and follow the links to download. RStudio is a powerful user interface for programming in R. I suggest you install the preview version of R studio.

To get started, open the Rstudio application (i.e., RStudio.exe or RStudio.app), not the vanilla application (i.e., not R.exe or R.app). You should be looking at something like this:

The RStudio IDE is divided into 4 separate panes (one of which is hidden for now) which all serve specific functions. The Console starts with information about the R version number, license and contributors. The last line is a standard prompt > that indicates R is ready and expecting instructions to do something.

You edit scripts in the editor panel in R Studio and see results in the bottom right output panel.

For now, to make sure R and RStudio are setup correctly, type x <- 3 + 2 into the Console pane and execute it by pressing Enter/Return. You just created an object in R called x. What does this object contain? Type print(x) or just x into the console and press enter again. Your console should now contain the following output

x <- 3 + 2
print(x)

## [1] 5

Congratulations! You installed R and RStudio succesfully, created an object x to which you assigned the value 3+2 and managed to print the value of x

Change character encoding to `UTF-8`, and `UTF-8` only

This may seem like an overly technical issue, but please bear with me. Since LBS is a very international school, we always seem to have issues with the language, or character encoding (Chinese, Arabic, Greek, Cyrillic, Hebrew, Thai, French, German, etc.), that people use in their computers. By default, all base R functions use the system native language encoding which has to do with the different languages some of us may have on our computers. Chinese and Greek users, having a completely different alphabet, typically report issues/problems/errors related to character encodings.

UTF-8 is the best possible character encoding, it works everywhere and we shall ask R Studio to use UTF-8 encoding globally. Please go to Tools… Global Options… Code… Saving and and change the default text encoding to UTF-8 as shown below:

Exiting R & RStudio

When quitting RStudio you will be asked whether to Save workspace with two options:

Yes - Your current R workspace (containing the work that you have done) will be restored next time you open RStudio.
No - You will start with a fresh R session next time you open RStudio. For now select “No” to prevent errors being carried over from previous sessions.

In general, it’s good practice to always start with a fresh new session. If you want to do that, please go to Tools… Global Optionsand make sure that

Restore .RData into workspace at startup is NOT ticked
Save workspace to .RData on exit: select Never
Always save history (even when not saving .RData) is NOT ticked

as shown below

Updating R and RStudio

If you already installed R or RStudio for a previous course, update both to the most current version. Generally this entails downloading and installing the most recent version of both programs. When you update R, you don’t actually remove the old version - you have all versions on your computer and default to the most recent one. Sometimes this is useful when specific R libraries require an older version of R, however we will generally stick to the most recent versions of R and RStudio.
When you update R, make sure to update your packages as well. The following command should perform most of this work, update.packages(ask = FALSE, checkBuilt = TRUE) or you can go through the Packages tab in the bottom right panel of RStudio.

R commands

We have already seen how we can type commands in the command prompt and use R as a simple calculator. For instance, try typing 5 + 20, and hitting enter. When you do this, you’ve entered a command, and R will execute that command. What you see on screen now will be this:

5 + 20

## [1] 25

Assignmnent Operator `<-`

R treats everything (single numbers, lists, vectors, datasets) as objects. To create an object, we must use the assignment operator <-. For instance, if we had data on a student whose name is Alex, is 28 years old, and comes from Athens, we would create three objects, name, height, and city and assign the values of Alex, 28, and Athens respectively, we would type

name <- "Alex"
age <- 28
city <- "Athens"

The two objects have now been created; if we wanted to print out their values, we can use the print() function or just type the names of the objects.

print(name); print(age); print(city)

## [1] "Alex"

## [1] 28

## [1] "Athens"

name

## [1] "Alex"

age

## [1] 28

city

## [1] "Athens"

You can mentally read the command age <- 28 as object age becomes equal to the value 28. There is a keyboard shortcut Alt + - to get the assignment operator. We can do more interesting and useful things creating variables and assigning values to them. For instance, if we have the relevant dimensions and wanted to calculate the area and volume of a room, we could do it as follows:

room_length <- 5.63
room_width  <- 6.48
room_height <- 2.93
room_area <- room_length * room_width
room_volume <- room_length * room_width * room_height

room_area

## [1] 36.4824

room_volume

## [1] 106.8934

R is case sensitive

R is case sentitive and needs everything exactly as it was defined. age is different from AgE and Age. So if you type

age <- 28
AgE <- 34
Age <- 55

age; AgE; Age

## [1] 28

## [1] 34

## [1] 55

R will create three different objects.

Typos

R is a brilliant piece of software, but it cannot handle typos. Unlike Google’s search, “Did you mean…”, it takes it on faith that what you typed is exactly what you meant. For example, suppose that you forgot to hit the shift key when trying to type +, and as a result your command ended up being 5 = 20 rather than 5 + 20. Here’s what happens:

5 = 20

## Error in 5 = 20: invalid (do_set) left-hand side to assignment

R attempted to interpret 5 = 20 as a command, and spits out an error message because this makes no sense to it. Even more subtle is the fact that some typos won’t produce errors at all, because they happen to correspond to R commands. For instance, suppose that instead of 5 + 20, I mistakenly type command 5 - 20. Clearly, R has no way of knowing that you meant to add 20 to 5, not subtract 20 from 5, so what happens this time is this:

5 - 20

## [1] -15

In this case, R produces the right answer, but to the the wrong question.

R will always try to do exactly what you ask it to do. There is no autocorrect or equivalent to “Did you mean..” in R, and for good reason. When doing advanced stuff and even the simplest of statistics is pretty advanced in a lot of ways, it’s dangerous to let a mindless automaton like R try to overrule the human user. But because of this, it’s your responsibility to be careful. Always make sure you type exactly what you mean. When dealing with computers, it’s not enough to type approximately the right thing. In general, you absolutely must be precise in what you say to R … like all machines it is too stupid to be anything other than absurdly literal in its interpretation.

Comments

It is useful to put comments in your code, to make everything more readable. These comments could help others and you when you go back to your code in the future. R comments start with a hashtag sign #. Everything after the hashtag to the end of the line will be ignored by R. RStudio by default thinks that every line you write is a command; if you want to turn a line into a comment, place the cursor in the line and hit Ctrl + Shift + C in Windows or Cmd + Shift + C in a Mac.

# This line is a comment and will be ignored when run.
city # Text after the hashtag "#" is also ignored.

## [1] "Athens"

R knows you’re not finished

If you hit enter in a situation where it’s obvious to R that you haven’t actually finished typing the command, R is just smart enough to keep waiting. For example, if you wanted to calculate 15 - 4, and start by typing type 15 - and then press enter by mistake, R is smart enough to realise that you probably wanted to type in another number. So here’s what happens:

> 15 -
+

and there’s a blinking cursor next to the plus + sign. What this means is that R is still waiting for you to finish. It thinks you’re still typing your command, so it hasn’t tried to execute it yet. In other words, this plus sign is actually another command prompt. It’s different from the usual one (i.e., the > symbol) to remind you that R is going to add whatever you type now to what you typed last time. For example, if I then go on to type 4 and hit enter, what we get:

> 15 -
+ 4
[1] 11

And as far as R is concerned, this is exactly the same as if you had typed 15 - 4.

By the way, if after entering the 15 - you wanted to stop execution and cancel your command, just hit the escape key. R will return you to the normal command prompt (i.e. >) without attempting to execute the botched command.

Arithmetic Operations and Functions

R has the basic operators and you can use it as as simple calculator: addition is +, subtraction is -, multiplication is *, division is /, and ^ is the power operator:

2 + 3

## [1] 5

5 - 8

## [1] -3

13 * 21

## [1] 273

34 / 55

## [1] 0.6181818

(5 * 13)/4 - 7

## [1] 9.25

# ^ : to the power off
2^3

## [1] 8

# square root
sqrt(25)

## [1] 5

Besides the basic operations functions, you can use standard mathematical functions

Rounding -round(), floor(), ceiling(),
Logarithms and Exponentials -exp(), log(), log10(), log2()

# R knows pi = 3.1415926...

# round to 2 decimal places 
round(pi, digits = 2); round(pi,2)

## [1] 3.14

## [1] 3.14

#Round down to nearest interger
floor(pi)

## [1] 3

#Round up to nearest interger
ceiling(pi)

## [1] 4

RStudio help

At this stage you know how to type in basic commands, including how to use some R functions. Few analysts bother to try to know or remember all the commands. What they really do is use tricks to make their lives easier. The first (and arguably most important one) is to use the internet. If you don’t know how a particular R function works, Google it. There is a lot of R documentation out there, and almost all of it is searchable! For the moment though, I want to call your attention to a couple of simple tricks that Rstudio makes available to you.

Tab autocomplete

The first thing I want to call your attention to is the autocomplete ability in Rstudio. Assume that what you want to do is to round a number. This time around, start typing the name of the function that you want, and then hit the Tab key. Rstudio will then display a little window like the one shown here:

In this figure, we have typed the letters rou at the command line, and then hit tab. The window has two panels. On the left, there’s a list of variables and functions that start with the letters typed shown in black text, and some grey text that tells you where that variable/function is stored. In our case, round is included in the {base} R, what is included in every new installation of R. There’s a few options there, and the one we want is round, but if you’re typing this yourself you’ll notice that when you hit the tab key the window pops up with the top entry highlighted. You can use the up and down arrow keys to select the one that you want. Or, if none of the options look right to you, you can hit the escape key (ESC) or the left arrow key to make the window go away.

In our case, the thing we want is the round option, and the panel on the right tells you a bit about how the function works. This display is really handy. The very first thing it says is round(x, digits = 0): what this is telling you is that the round function has two arguments. The first argument is called x, and it doesn’t have a default value. The second argument is digits, and it has a default value of 0. In a lot of situations, that’s all the information you need. But Rstudio goes a bit further, and provides some additional information about the function underneath. Sometimes that additional information is very helpful, sometimes it’s not: Rstudio pulls that text from the R help documentation, and my experience is that the helpfulness of that documentation varies wildly. Anyway, if you’ve decided that round is the function that you want to use, you can hit the enter key and Rstudio will finish typing the rest of the function name for you.

The history pane

One thing R does is keep track of your *command history**, i.e., it remembers all the commands previously typed. You can access this history in a few different ways. The simplest way is to use the up and down arrow keys. If you hit the up key, the R console will show you the most recent command that you’ve typed. Hit it again, and it will show you the command before that. If you want the text on the screen to go away, hit escape. Using the up and down keys can be handy if you’ve typed a long command that had one typo in it. Rather than having to type it again from scratch, you can use the up key to bring up the command and fix it.

The second way to get access to your command history is to look at the history panel in Rstudio. On the upper right panel of the Rstudio window, you’ll see a tab labelled History.

Click on that, and you’ ll see a list of all your recent commands displayed in that panel– double click on one of the commands, and it will be copied to the R console. You can achieve the same result by selecting the command you want with the mouse and then clicking the *“To Console”** button.

More resources

RStudio have a produced a great series of video tutorials RStudio Essentials Videos
RStudio IDE Cheatsheet

Acknowledgements

This page is derived in part from “R for Psychological Science”.

Textbooks and other resources

Sat, 25 Jul 2020 00:00:00 +0000

The following is a non-exhaustive list of free online textbooks and resources that use R

Textbooks/Readings

R Programming

R for Data Science – Garrett Grolemund and Hadley Wickham
- Open-source online version is available for free; Available for purchase online
- No official solution manual for the book exercises exists, but several can be found online, like this version by Jeffrey B. Arnold. Your exact solutions may vary, but these are a good starting point.
Hands-On Programming with R by Garrett Grolemund. This is a non-statistical introduction to R programming with many hands-on examples.
Advanced R – Hadley Wickham
- Hardcover available online, but the online version is free
- A deeper dive into R as a programming language, not just a tool for data science. Most of this material is best covered on your own after you are familiar with R.
R Programming for Data Science
Modern Data Science with R – Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton

Statistics with R

Modern Dive: A moderndive into R and the tidyverse by Chester Ismay and Albert Y. Kim
Learning Statistics with R by Danielle Navarro
OpenIntro Statistics Open-source online version is available for free
An Introduction to Statistical Learning: with Applications in R – Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Each chapter includes code that demonstrates how to implement different methods. Unfortunately, their code use a lot of base R functions and syntax, whereas our emphasis is on getting things done with the tidyverse collection of R packages. However, this is still a great book and the code provided is useful.
- You can download a free PDF of the entire book from the authors’ site
Broadening Your Statistical Horizons is an applied textbook on generalized linear models, with all of the examples / code in R.
Forecasting: Principles and Practice
Tidy Modeling with R The purpose of this book is to demonstrate how the tidyverse and tidymodels can be used to produce high quality models.
Text Mining with R by Julia Silge and David Robinson. What happens if your data is text, rather than numbers? What if you wanted to do sentiment analysis?

Visualisations

The de-facto standard for visualisations in R is the ggplot2 package. If you want to read Hadley Wickham’s paper that implemented the grammar of graphics into R, you can find it here

Data Visualization: A Practical Introduction by Kieran Healy.
Fundamentals of Data Visualization by Claus O. Wilke.
R Graphics Cookbook A practical guide by Winston Chang that provides any specific examples/ recipes to help you generate high-quality graphs quickly. I use it as quick reference to get my ggplot working.
The BBC Visual and Data Journalism cookbook for R graphics
Interactive web-based data visualization with R, plotly, and shiny

Spatial Visualisations

Online resources

Data science and statistical programming can be challenging. Computers are dumb and tiny errors in your code can cause hours of frustration (even if you’ve been doing this stuff for years!).

Fortunately, there are tons of online resources to help you with this. Two of the most important are StackOverflow (a Q&A site with thousands of answers to all sorts of statistical and programming questions) and RStudio Community (a forum specifically designed for people using RStudio and the tidyverse).

I highly recommend subscribing to the R Weekly newsletter which is sent every Monday and is full of helpful tutorials and ideas on how to do stuff with R.
RStudio Cheatsheets Printable cheat sheets for common R tasks and features

Software

Typora is a lightweight, stand alone editor for Markdown documents

Companies, Government Agencies, and NGOs Using R

Organisations using R

Podcasts

Tim Harford’s More or Less explains and debunks the numbers and statistics used in political debate, the news and everyday life. A great episode on sampling can be found here
Everything Hertz: A podcast by scientists, for scientists. Methodology, scientific life, and bad language

Using Markdown

Sat, 25 Jul 2020 00:00:00 +0000

Markdown is a special kind of markup language that lets you format text with simple syntax. You can then use a converter program like pandoc to convert Markdown into whatever format you want: HTML, PDF, Word, PowerPoint, etc. (see the full list of output types here)

Basic Markdown formatting

Type…	…or…	…to get
Some text in a paragraph. More text in the next paragraph. Always use empty lines between paragraphs.		Some text in a paragraph. More text in the next paragraph. Always use empty lines between paragraphs.
`Italic`	`_Italic_`	Italic
`Bold`	`__Bold__`	Bold
`# Heading 1`		Heading 1
`## Heading 2`		Heading 2
`### Heading 3`		Heading 3
(Go up to heading level 6 with `######`)
`[Link text](http://www.example.com)`		Link text
`![Image caption](/path/to/image.png)`
`Inline code` with backticks		`Inline code` with backticks
`> Blockquote`		Blockquote
- Things in - an unordered - list	* Things in * an unordered * list	Things in an unordered list
1. Things in 2. an ordered 3. list	1) Things in 2) an ordered 3) list	Things in an ordered list
Horizontal line ---	Horizontal line ***	Horizontal line

Mathematical formulas

Markdown uses LaTeX to create fancy mathematical equations. There are tons of little options and features available for math equations—you can find helpful examples of the the most common basic commands here.

You can use math in two different ways: inline or in a display block. To use math inline, wrap it in single dollar signs, like $y = mx + b$ :

Type…	…to get
Based on our regression model for estimating the effect of education on wages is $\hat{y} = \beta_0 + \beta_1 x_1 + \epsilon$, or $\text{Wages} = \beta_0 + \beta_1 \text{Education} + \epsilon$.	Based on our regression model for estimating the effect of education on wages is $\hat{y} = \beta_0 + \beta_1 x_1 + \epsilon$, or $\text{Wages} = \beta_0 + \beta_1 \text{Education} + \epsilon$.

To put an equation on its own line in a display block, wrap it in double dollar signs, like this:

Type…

The quadratic equation was an important part of high school math:

$$
x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
$$

But now we just use computers to solve for $x$.

…to get…

The quadratic equation was an important part of high school math:

\[ x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} \]

But now we just use computers to solve for $x$.

Because dollar signs are used to indicate math equations, you can’t just use dollar signs like normal if you’re writing about actual dollars. For instance, if you write This book costs $5.75 and this other costs $40, Markdown will treat everything that comes between the dollar signs as math, like so: “This book costs $5.75 and this other costs $40”.

To get around that, put a backslash (\) in front of the dollar signs, so that This book costs \$5.75 and this other costs \$40 becomes “This book costs $5.75 and this other costs $40”.

Tables

There are 4 different ways to hand-create tables in Markdown—I say “hand-create” because it’s normally way easier to use R to generate these things with packages like pander (use pandoc.table()) or knitr (use kable()). The two most common are simple tables and pipe tables. You can find the full documentation here.

For simple tables, type…

  Right     Left     Center     Default
-------     ------ ----------   -------
     12     12        12            12
    123     123       123          123
      1     1          1             1

Table: Caption goes here

…to get…

Caption goes here
Right	Left	Center	Default
12	12	12	12
123	123	123	123
1	1	1	1

For pipe tables, type…

| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
|   12  |  12  |    12   |    12  |
|  123  |  123 |   123   |   123  |
|    1  |    1 |     1   |     1  |

Table: Caption goes here

…to get…

Caption goes here
Right	Left	Default	Center
12	12	12	12
123	123	123	123
1	1	1	1

Front matter

You can include a special section at the top of a Markdown document that contains metadata (or data about your document) like the title, date, author, etc. This section uses a special simple syntax named YAML (or “YAML Ain’t Markup Language”) that follows this basic outline: setting: value for setting. Here’s an example YAML metadata section. Note that it must start and end with three dashes (---).

---
title: Title of your document
date: "January 13, 2020"
author: "Your name"
---

You can put the values inside quotes (like the date and name in the example above), or you can leave them outside of quotes (like the title in the example above). I typically use quotes just to be safe—if the value you’re using has a colon (:) in it, it’ll confuse Markdown since it’ll be something like title: My cool title: a subtitle, which has two colons. It’s better to do this:

---
title: "My cool title: a subtitle"
---

If you want to use quotes inside one of the values (e.g. your document is An evaluation of "scare quotes"), you can use single quotes instead:

---
title: 'An evaluation of "scare quotes"'
---

Other references

These websites have additional details and examples and practice tools:

CommonMark’s Markdown tutorial: A quick interactive Markdown tutorial.
Markdown tutorial: Another interactive tutorial to practice using Markdown.
Markdown cheatsheet: Useful one-page reminder of Markdown syntax.
The Plain Person’s Guide to Plain Text Social Science: A comprehensive explanation and tutorial about why you should write data-based reports in Markdown.

World Bank Data

Fri, 31 Jul 2020 00:00:00 +0000

The World Bank is one of the world’s largest producers of development data and research. It is a great source of global socio-economic data, spanning several decades and many topics. For example, you can read their 2018 Atlas of Sustainable Development Goals or a blog post on their all-new visual guide to data and development.

The wbstats package allows you to search for and download any open World Bank dataset. To identify the actual indicator you want, you have to find its code either in the World Bank datacatalog or, even better, through wbstats.

Population Growth 1970-2017

Suppose we wanted to get data on population growth. Manually, we would navigate to the World Bank datacatalog website, and search for population growth.

We get various results, but the more important ones are usually at the top with data on Population Growth (Annual %) with code SP.POP.GROW, on Rural Population Growth (Annual %) with code SP.RUR.TOTL.ZG, and on Urban Population Growth (Annual %) with code SP.URB.GROW.

Alternatively, we would load the wbstats package, and use pop_growth_codes <- wbsearch(pattern = "population growth") to get a dataframe with the codes that the search function returns.

library(wbstats)

pop_growth_codes <- wb_search(pattern = "population growth")
head(pop_growth_codes)

## # A tibble: 4 x 3
##   indicator_id    indicator            indicator_desc                           
##   <chr>           <chr>                <chr>                                    
## 1 IN.EC.POP.GRWT~ Decadal Growth of P~ Population growth rate over the 10 year ~
## 2 SP.POP.GROW     Population growth (~ Annual population growth rate for year t~
## 3 SP.RUR.TOTL.ZG  Rural population gr~ Rural population refers to people living~
## 4 SP.URB.GROW     Urban population gr~ Urban population refers to people living~

Either way, the indicator we are interested in is Population Growth Annual and its code = SP.POP.GROW. The next step is to download the data with the wbstats::wb_data() function.

The first argument the wb_data function takes is a list of countries; if left empty, is will download all data for individual countries and aggregate regions like Arab World, Euro area, etc. In our example, let us download data for individuals countries only starting at 1970 and ending in 2017.

# Download data for Population Growth Annual% SP.POP.GROW
pop_growth_data <- wb_data(country = "countries_only", 
                      indicator = "SP.POP.GROW", 
                      start_date = 1970, 
                      end_date = 2017,
                      return_wide=FALSE)

glimpse(pop_growth_data)

## Rows: 10,416
## Columns: 11
## $ indicator_id <chr> "SP.POP.GROW", "SP.POP.GROW", "SP.POP.GROW", "SP.POP.G...
## $ indicator    <chr> "Population growth (annual %)", "Population growth (an...
## $ iso2c        <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", ...
## $ iso3c        <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG"...
## $ country      <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanis...
## $ date         <dbl> 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, ...
## $ value        <dbl> 2.55, 2.78, 3.08, 3.36, 3.49, 3.41, 3.14, 2.75, 2.40, ...
## $ unit         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ obs_status   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ footnote     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ last_updated <date> 2020-08-18, 2020-08-18, 2020-08-18, 2020-08-18, 2020-...

The wb_cachelist is a cached version of useful information from the World Bank API and provides a snapshot of available countries, indicators, and other relevant information. The structure of wb_cachelist is as follows

glimpse(wb_cachelist, max.level = 1)

## List of 8
##  $ countries    : tibble [304 x 18] (S3: tbl_df/tbl/data.frame)
##  $ indicators   : tibble [16,607 x 8] (S3: tbl_df/tbl/data.frame)
##  $ sources      : tibble [61 x 9] (S3: tbl_df/tbl/data.frame)
##  $ topics       : tibble [21 x 3] (S3: tbl_df/tbl/data.frame)
##  $ regions      : tibble [48 x 4] (S3: tbl_df/tbl/data.frame)
##  $ income_levels: tibble [7 x 3] (S3: tbl_df/tbl/data.frame)
##  $ lending_types: tibble [4 x 3] (S3: tbl_df/tbl/data.frame)
##  $ languages    : tibble [23 x 3] (S3: tbl_df/tbl/data.frame)

and as we can see it contains data on countries and aggregate regions, well over 16,000 indicators, etc. If we wanted to see the data on countries, let us create a dataframecountries` and glimpse its contents.

countries <-  wb_cachelist$countries
glimpse(countries)

## Rows: 304
## Columns: 18
## $ iso3c              <chr> "ABW", "AFG", "AFR", "AGO", "ALB", "AND", "ANR",...
## $ iso2c              <chr> "AW", "AF", "A9", "AO", "AL", "AD", "L5", "1A", ...
## $ country            <chr> "Aruba", "Afghanistan", "Africa", "Angola", "Alb...
## $ capital_city       <chr> "Oranjestad", "Kabul", NA, "Luanda", "Tirane", "...
## $ longitude          <dbl> -70.02, 69.18, NA, 13.24, 19.82, 1.52, NA, NA, 5...
## $ latitude           <dbl> 12.52, 34.52, NA, -8.81, 41.33, 42.51, NA, NA, 2...
## $ region_iso3c       <chr> "LCN", "SAS", NA, "SSF", "ECS", "ECS", NA, NA, "...
## $ region_iso2c       <chr> "ZJ", "8S", NA, "ZG", "Z7", "Z7", NA, NA, "ZQ", ...
## $ region             <chr> "Latin America & Caribbean", "South Asia", "Aggr...
## $ admin_region_iso3c <chr> NA, "SAS", NA, "SSA", "ECA", NA, NA, NA, NA, "LA...
## $ admin_region_iso2c <chr> NA, "8S", NA, "ZF", "7E", NA, NA, NA, NA, "XJ", ...
## $ admin_region       <chr> NA, "South Asia", NA, "Sub-Saharan Africa (exclu...
## $ income_level_iso3c <chr> "HIC", "LIC", NA, "LMC", "UMC", "HIC", NA, NA, "...
## $ income_level_iso2c <chr> "XD", "XM", NA, "XN", "XT", "XD", NA, NA, "XD", ...
## $ income_level       <chr> "High income", "Low income", "Aggregates", "Lowe...
## $ lending_type_iso3c <chr> "LNX", "IDX", NA, "IBD", "IBD", "LNX", NA, NA, "...
## $ lending_type_iso2c <chr> "XX", "XI", NA, "XF", "XF", "XX", NA, NA, "XX", ...
## $ lending_type       <chr> "Not classified", "IDA", "Aggregates", "IBRD", "...

The dataframe contains the ISO country codes, the country name, its capital with its longitude and latitude, the region the country is in, the regions associated ISO code, as well as a classification on the income group, the country’s classification by income level, etc.

We can merge the dataframes pop_growth_data and countries with a left join, so we have a dataframe that contains data from both of them

countries <-  wb_cachelist$countries


# Merge with a left_join (a) country data with (b) population growth data
pop_growth <- 
  left_join(countries, pop_growth_data, by="iso3c") %>% 
              mutate(year = as.integer(date)) %>%  #make year an integer, rather than a character value
              select(iso3c, country.x, region, income_level, value, year=) %>% 
              na.omit()

Let us calculate and plot the average population growth for all countries between 1970 and 2017, faceted by region.

average_pop_growth <- pop_growth %>% 
              dplyr::group_by(region, income_level, country.x, iso3c) %>% 
              summarise(average_growth = mean(value)) %>% 
              arrange(average_growth) %>% 
              ungroup()

ggplot(data = average_pop_growth, 
       aes(x = reorder(country.x, average_growth), 
           y = average_growth, 
           fill = region))+
  geom_col()+
  coord_flip()+
  theme_minimal(7)+
  expand_limits(y=c(-1,8))+
  facet_wrap(~income_level, nrow=3, scales="free")+
  labs(title = 'Average annual population growth (%), 1970-2017',
       x = "",
       y = "Average Annual Population Growth (in %)",
       caption = 'Source: Worldbank') +
  # theme(legend.position="none")+
  NULL

World Happiness: how does it correlate with various indicators

Data from the UN’s World Happiness Report is available at Kaggle. We have downloaded the 2015 report in a CSV file, and have a quick glimpse at its structure.

world_happiness_2015 <- read_csv(here::here("data", "world_happiness_2015.csv"))
glimpse(world_happiness_2015)

As you notice, some of the variable names include a space, like Happiness Rank, all start with a capital letter, etc. We will use janitor::clean_names() to clean the variable names, so they are easier to deal with.

library(janitor)

world_happiness_2015 <- world_happiness_2015 %>%
  clean_names()

glimpse(world_happiness_2015)

## Rows: 158
## Columns: 12
## $ country                     <chr> "Switzerland", "Iceland", "Denmark", "N...
## $ region                      <chr> "Western Europe", "Western Europe", "We...
## $ happiness_rank              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
## $ happiness_score             <dbl> 7.59, 7.56, 7.53, 7.52, 7.43, 7.41, 7.3...
## $ standard_error              <dbl> 0.0341, 0.0488, 0.0333, 0.0388, 0.0355,...
## $ economy_gdp_per_capita      <dbl> 1.397, 1.302, 1.325, 1.459, 1.326, 1.29...
## $ family                      <dbl> 1.350, 1.402, 1.361, 1.331, 1.323, 1.31...
## $ health_life_expectancy      <dbl> 0.941, 0.948, 0.875, 0.885, 0.906, 0.88...
## $ freedom                     <dbl> 0.666, 0.629, 0.649, 0.670, 0.633, 0.64...
## $ trust_government_corruption <dbl> 0.4198, 0.1414, 0.4836, 0.3650, 0.3296,...
## $ generosity                  <dbl> 0.2968, 0.4363, 0.3414, 0.3470, 0.4581,...
## $ dystopia_residual           <dbl> 2.52, 2.70, 2.49, 2.47, 2.45, 2.62, 2.4...

First, we can look how happiness_score correlates with its of the variables the UN uses. We will use GGally:ggpairs() to get a correlation- scatterplot matrix. We do not want to include in our analyses the country name, its region, the happiness_rank and the standard error associated with the estimate of the happiness score.

world_happiness_2015 %>% 
  select(-country, -region, -happiness_rank, -standard_error) %>% 
  GGally::ggpairs()

We will now choose six (6) indicators form the World Bank data, downloads their values for 2015 and see how these correlate with the overall happiness score.

# Download data for the following indicators

indicators <- c("SE.PRM.NENR",     # School enrollment, primary (% net)
                "SP.DYN.LE00.IN",  # Life expectancy
                "SI.POV.DDAY",     # Extreme poverty (% earning less than $2/day)
                "EG.ELC.ACCS.ZS",  # Access to electricity
                "SI.POV.GINI",     # GINI Index
                "NY.GDP.PCAP.KD")  # GDP per capita


happiness_data_WB_long <- wb_data(country = "countries_only", 
                             indicator = indicators, 
                             start_date = 2015, 
                             end_date = 2015,
                             #since we have many indicators, we should get the data in long format
                             return_wide=FALSE) 

# look at the long dataframe
glimpse(happiness_data_WB_long)

## Rows: 1,302
## Columns: 11
## $ indicator_id <chr> "SE.PRM.NENR", "SE.PRM.NENR", "SE.PRM.NENR", "SE.PRM.N...
## $ indicator    <chr> "School enrollment, primary (% net)", "School enrollme...
## $ iso2c        <chr> "AF", "AL", "DZ", "AS", "AD", "AO", "AG", "AR", "AM", ...
## $ iso3c        <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "ATG", "ARG"...
## $ country      <chr> "Afghanistan", "Albania", "Algeria", "American Samoa",...
## $ date         <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, ...
## $ value        <dbl> NA, 94.2, 97.5, NA, NA, NA, 94.2, 99.5, 92.7, NA, 97.0...
## $ unit         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ obs_status   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ footnote     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Natio...
## $ last_updated <date> 2020-08-18, 2020-08-18, 2020-08-18, 2020-08-18, 2020-...

In order to get the two dataframes to combine into one, they have to have a shared column/ variable. We will merge the two datasets with a left_join() by “country”, and glimpse the structure of the resulting dataframe.

# Merge with a left_join (a) happiness data with all indicators and (b) the 2015 World Happiness index 

happiness <- 
  left_join(happiness_data_WB_long, world_happiness_2015, by="country") 

glimpse(happiness)

## Rows: 1,302
## Columns: 22
## $ indicator_id                <chr> "SE.PRM.NENR", "SE.PRM.NENR", "SE.PRM.N...
## $ indicator                   <chr> "School enrollment, primary (% net)", "...
## $ iso2c                       <chr> "AF", "AL", "DZ", "AS", "AD", "AO", "AG...
## $ iso3c                       <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO...
## $ country                     <chr> "Afghanistan", "Albania", "Algeria", "A...
## $ date                        <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201...
## $ value                       <dbl> NA, 94.2, 97.5, NA, NA, NA, 94.2, 99.5,...
## $ unit                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ obs_status                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ footnote                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ last_updated                <date> 2020-08-18, 2020-08-18, 2020-08-18, 20...
## $ region                      <chr> "Southern Asia", "Central and Eastern E...
## $ happiness_rank              <dbl> 153, 95, 68, NA, NA, 137, NA, 30, 127, ...
## $ happiness_score             <dbl> 3.58, 4.96, 5.61, NA, NA, 4.03, NA, 6.5...
## $ standard_error              <dbl> 0.0308, 0.0501, 0.0510, NA, NA, 0.0476,...
## $ economy_gdp_per_capita      <dbl> 0.320, 0.879, 0.939, NA, NA, 0.758, NA,...
## $ family                      <dbl> 0.303, 0.804, 1.078, NA, NA, 0.860, NA,...
## $ health_life_expectancy      <dbl> 0.3034, 0.8133, 0.6177, NA, NA, 0.1668,...
## $ freedom                     <dbl> 0.2341, 0.3573, 0.2858, NA, NA, 0.1038,...
## $ trust_government_corruption <dbl> 0.09719, 0.06413, 0.17383, NA, NA, 0.07...
## $ generosity                  <dbl> 0.3651, 0.1427, 0.0782, NA, NA, 0.1234,...
## $ dystopia_residual           <dbl> 1.95, 1.90, 2.43, NA, NA, 1.95, NA, 2.8...

We can create a histogram of happiness_score by region

ggplot(data = happiness, aes(x = happiness_score , fill=region))+
  geom_histogram()+
  theme_minimal()+
  facet_wrap(~region,nrow=5) +
  labs(title = '2015 World Happiness',
       x = "",
       y = "Total Happiness Score",
       caption = 'Source: Worldbank') +
  theme(legend.position="none")

We can also create a scatterplot of happiness_score against all the indicators we have downloaded.

ggplot(data = happiness, aes(x = value, y = happiness_score , colour=indicator))+
  geom_point()+
  geom_smooth(se=FALSE)+
  theme_minimal()+
  facet_wrap(~indicator,scales="free") +
  labs(title = '2015 World Happiness',
       x = "",
       y = "Total Happiness Score",
       caption = 'Source: Worldbank') +
  theme(legend.position="none")

Acknowledgments

This page is derived in part from Introduction to the wbstats R-package by Jesse Piburn.

Installing the tidyverse

Sat, 25 Jul 2020 00:00:00 +0000

Installing the `tidyverse`

R packages are easy to install with RStudio. Select the packages panel, click on “Install,” type the name of the package you want to install, and press enter.

This can sometimes be tedious when you’re installing lots of packages, though. The tidyverse ¹ for instance, consists of dozens of packages that all work together. Rather than install each package individually, you can install tidyverse, a meta-package if you wish, and get them all at the same time.

Go to the packages panel in RStudio, click on “Install,” type “tidyverse”, and press enter. You’ll see a bunch of output in the RStudio console as all the tidyverse packages are installed.

RStudio generates a line of code for you and run it: install.packages("tidyverse"). You can also just paste and run this instead of using the packages panel.

# install the major packages from the tidyverse
install.packages("tidyverse")

This will take a while as tidyverse is a collection of packages and R will have to install all dependencies.

Installing the `tidyverse` if you have a Mac

Unfortunately, installing the tidyverse isn’t quite always a straight-forward task with the current version of macOS 10.14, Mojave which was released on September 24, 2018.

To solve issues that may arise with missing xml2 library, please do the following:

Open Terminal (the tab right next to Console)
Type

xcode-select --install

Be careful as you do need two (2) dashes before the install. A software update popup window should appear that will ask if you want to install command line developer tools. Click on “Install” (you don’t need to click on “Get Xcode”)

Go to https://brew.sh and copy the long command under “Install Homebrew” (starts with /usr/bin/ruby -e "$(curl -fsSL.), paste it into Terminal, and press enter.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

This installs Homebrew, which is special software that lets you install Unix-y programs from the terminal.

Type the following command line in Terminal to install libxml2

brew install libxml2

Then, within RStudio, type

install.packages("xml2")

Finally, you can now proceed with the installation of the tidyverse

install.packages("tidyverse")

Installing further packages

Once the tidyverse collection of packages installs and you get back to the R prompt >, you can install a series of packages that will be useful later in the course. You can copy/paste the code below; please note that this will take quite a while, so grab a coffee.

# install these packages as well
list_of_packages <- c(
  "moderndive",   # https://www.moderndive.com/
  "DT",           # Allows us to handle Data Tables and manipulate data faster 
  "unvotes",      # How countries have voted in UN resolutions
  "gridExtra",    # Miscellaneous Functions for "Grid" Graphics
  "GGally",       # Allows us to create a correlations/scatterplots matrix 
  "tidyquant",    # Download and manipulate financial data
  "wbstats",      # Download World Bank Data
  "eurostat",     # Download data from Eurostat
  "fpp2",         # Time Series and Forecasting fucntions, with data too 
  "car",          # Applied Regression- allows to calculate VIF, Variance Inflation Factor
  "gapminder",    # Data on life expectancy, GDP/capita, and population by country and year
  "nycflights13", # Data on all domestic flights through NYCs 3 airports (JFK, EWR, LGA) in 2013
  "fivethirtyeight", #Data used in articles that appeared in the fivethirtyeight.com website
  "corrr",        # correlation in R
  "plotly",       # interactive visualizations
  "sf",           # tidy geo-computing
  "cowplot",      # ggplot multiple figures addon
  "coefplot",     # plot coefficients from fitted models
  "interplot",    # plot effects of variables in interaction terms
  "scales",       # scale functions for visualisations 
  "ggridges",     # ridgeline plots in ggplot2
  "skimr",        # nice dataframe summaries
  "leaflet",      # interactive maps
  "ggrepel",      # geoms for ggplot2 to repel overlapping text labels
  "viridis",      # Colour Maps
  "rvest",        # scrape webpages
  "usethis",      # automation of package and project setup
  "remotes",      # installing packages from Github
  "tidytext",     # text mining
  "here",         # finding your files 
  "mosaic"        # summary stats, using mosaic::favstats()
)

install.packages(list_of_packages, dependencies=TRUE, repos = "https://cran.rstudio.com/")

Install from Github

Most of the time the packages that you’ll want to install have been made available on CRAN, the Comprehensive R Archive Network, so you use the install.packages("package_name") function. Sometimes people write packages that are not submitted to CRAN, and sometimes you might want to try out a package that is currently under development. In these situations, people who write packages will often make them available on GitHub. We can install packages directly from Github, using the remotes package.

The first thing you need to do is install remotes, which is easy because that package is available on CRAN and hopefully you installed it with all packages listed earlier. If not,

install.packages("remotes")

Once you install remotes, you must explicitly say to R you will be using it by typing library(devtools). Then, you can use the install_github command to install a package directly from a GitHub repository. For example, there’s an R data package featuring every Lego set from 1970 to 2015 put together by Sean Kross.

remotes::install_github("seankross/lego") #install the lego package directly from Github

R fetches and installs the package from Github, and we now have the new lego package to play with. To verify that everything worked properly, let’s load the lego package and look at its legosets dataframe:

library(lego)     #load the lego package into the computer's memory

legosets          #view the legosets dataframe

## # A tibble: 6,172 x 14
##    Item_Number Name    Year Theme Subtheme Pieces Minifigures Image_URL GBP_MSRP
##    <chr>       <chr>  <int> <chr> <chr>     <int>       <int> <chr>        <dbl>
##  1 10246       Detec~  2015 Adva~ "Modula~   2262           6 http://i~   133.  
##  2 10247       Ferri~  2015 Adva~ "Fairgr~   2464          10 http://i~   150.  
##  3 10248       Ferra~  2015 Adva~ "Vehicl~   1158          NA http://i~    70.0 
##  4 10249       Toy S~  2015 Adva~ "Winter~    898          NA http://i~    60.0 
##  5 10581       Ducks   2015 Duplo "Forest~     13           1 http://i~     9.99
##  6 10582       Anima~  2015 Duplo "Forest~     39           2 http://i~    17.0 
##  7 10583       Fishi~  2015 Duplo "Forest~     32           2 http://i~    20.0 
##  8 10584       Forest  2015 Duplo "Forest~    105           3 http://i~    50.0 
##  9 10585       Mom a~  2015 Duplo ""           13           2 http://i~     8.99
## 10 10586       Ice C~  2015 Duplo ""           11           2 http://i~    13.0 
## # ... with 6,162 more rows, and 5 more variables: USD_MSRP <dbl>,
## #   CAD_MSRP <dbl>, EUR_MSRP <dbl>, Packaging <chr>, Availability <chr>

glimpse(legosets) #examine the structure of the dataframe- variables, observations, type of variables, etc.

## Rows: 6,172
## Columns: 14
## $ Item_Number  <chr> "10246", "10247", "10248", "10249", "10581", "10582", "10~
## $ Name         <chr> "Detective's Office", "Ferris Wheel", "Ferrari F40", "Toy~
## $ Year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ Theme        <chr> "Advanced Models", "Advanced Models", "Advanced Models", ~
## $ Subtheme     <chr> "Modular Buildings", "Fairground", "Vehicles", "Winter Vi~
## $ Pieces       <int> 2262, 2464, 1158, 898, 13, 39, 32, 105, 13, 11, 52, 13, 2~
## $ Minifigures  <int> 6, 10, NA, NA, 1, 2, 2, 3, 2, 2, 3, 1, NA, NA, NA, NA, 1,~
## $ Image_URL    <chr> "http://images.brickset.com/sets/images/10246-1.jpg", "ht~
## $ GBP_MSRP     <dbl> 132.99, 149.99, 69.99, 59.99, 9.99, 16.99, 19.99, 49.99, ~
## $ USD_MSRP     <dbl> 159.99, 199.99, 99.99, 79.99, 9.99, 19.99, 24.99, 59.99, ~
## $ CAD_MSRP     <dbl> 199.99, 229.99, 119.99, NA, 12.99, 24.99, 29.99, 69.99, 1~
## $ EUR_MSRP     <dbl> 149.99, 179.99, 89.99, 69.99, 9.99, 19.99, 24.99, 59.99, ~
## $ Packaging    <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box", "~
## $ Availability <chr> "Retail - limited", "Retail - limited", "LEGO exclusive",~

The dataframe has 14 variables (or columns) and 6,172 observations (rows). Besides the item number, year, theme/subtheme and the number of pieces and minifigures contained in each Lego box, we also have the recommeneded retail prices in GBP, USD, CAD, and EUR. While we are at it, let us have a quick look at how Lego prices (in GBP) have evolved over the years.

avg_price_per_year <- legosets %>% # create avg_price_year" by taking legosets, and then
  filter(!is.na(GBP_MSRP)) %>%    # filter out entries with no GBP prices, GBP_MSRP, and then
  group_by(Year) %>%              # group prices by year
  summarise(Price = mean(GBP_MSRP)) # create variable "Price" = yearly average of GBP_MSRP

ggplot(avg_price_per_year, 
       mapping = aes(x = Year, y = Price)) +  # time series plot: x=Year, y=Price
  geom_point(size = 0.5) +                    # simple scatterplot Y vs. X
  geom_line(size = 0.5) +                     # add the black line between points
  geom_smooth(se = FALSE) +                   # fit trend line,no error band around it "se = FALSE" 
  labs(x = "Year",   
       y = "Price (GBP)", 
       title = "Average price of LEGO sets",
       subtitle = "Amounts are reported in current GBP",
       caption = "Source: LEGO") +
  theme_bw()

There is a clear upward trend in average GBP prices.

And since we are talking about LEGOs, here is a fun application of creating LEGO mosaics from photos using R & the tidyverse

Updating packages

Every now and then the authors of packages release updated versions. The updated versions often add new functionality, fix bugs, and so on. It’s a good idea to update your packages periodically.

There’s an update.packages function, but it’s probably easier to stick with the RStudio tool. In the packages tab, click on the Update Packages button. This will bring up a window that looks like the one shown below:

In this window, each row refers to a package that needs to be updated. You can select which updates to install by checking the boxes on the left. If you feel lazy, click the Select All button, and then Install Updates. This might take a while to complete depending on how fast your internet connection is.

Updating R

About twice a year, a new version of R is released, and the features of all packages get changed to be compatible with the new version of R. The side effect of packages being compatible with the newest R version is that then you update to the newest version of R, you lose all the packages that you have downloaded and installed. Unfortuantely, you need to install the new versions of packages, even though they will typically behave just like the old ones.

Install `tinytex`

When you knit to PDF, R uses a special scientific typesetting program named LaTeX (pronounced “lay-tek” or “lah-tex”; for goofy nerdy reasons, the x is technically the “ch” sound in “Bach”, but most people just say it as “k”—saying “layteks” is frowned on for whatever reason).

LaTeX makes pretty documents, but it’s a huge program—the macOS version, for instance, is nearly 4 GB! To make life easier, there’s an R package named tinytex that installs a minimal LaTeX program and that automatically deals with differences between macOS and Windows.

Here’s how to install tinytex so you can knit to pretty PDFs:

Use the Packages in panel in RStudio to install tinytex like you did above with tidyverse. Alternatively, run install.packages("tinytex") in the console.
Run tinytex::install_tinytex() in the console.
Wait for a bit while R downloads and installs everything you need.
The end! You should now be able to knit to PDF.

A universe of packages centered around tidy data, including ggplot2↩︎

Using R Markdown

Sat, 25 Jul 2020 00:00:00 +0000

Reproducibility in scientific research

Reproducibility is the idea that data analyses, and more generally scientific claims, are published with their data and software code so that others may try to replicate the same work, get similar results, and build upon the works of others.

While this sounds obvious, it actually happens far less frequently than what it should.

For instance, scientists at the biotechnology company Amgen were unable to replicate the majority of published pre-clinical cancer research studies; as a matter of fact, only 6 out of 53 landmark results could be reproduced. Similarly, it has been argued that the great majority of preclinical results cannot be reproduced, leading to an annual estimate of the cost of irreproducibility on preclinical research industry to be equal to 28 Billion USD.

“You are always working with at least one collaborator: Future you.”
– Hadley Wickham

Suppose that your colleague sends you an Excel file with an analysis she has undertaken. The Excel file is likely to contain the raw data, but also graphs, results, etc. that were generated from the data. If you have ever received such an Excel analysis file, it takes a long time to navigate around it and try to understand the logic used to arrive at the results.

Data analysts who implement reproducibility in their projects can quickly and easily reproduce the original results and trace back to determine how they were derived. Literate programming, an idea from Donald Knuth, is a technique for mixing written text, where you write notes explaining what you did and why, and chunks of code that produce your graphs, analyses, etc.

This makes documentation of code easier, enables verification and replication, and allows the analyst to precisely replicate her analysis. This is extremely important when revisiting work done months later, because it’s highly likely you won’t remember how all the code/analysis works together when completing your work.

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.
– Donald E. Knuth (1984), Literate Programming

Reproducibility is also key for communicating findings with other and decision makers; it allows them to follow your logic and verify your results, assess your assumptions, and understand how your answers were formed rather than solely relying on your claimed results. In the data science framework employed in R for Data Science, reproducibility is infused throughout the entire workflow.

Your reproducibility goals should be:

Are the results (tables and figures) reproducible from the code and data?
Does the code actually do what you think it does?
Is the code well documented so someone else can foolow your work?
In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
Can the code be used for other, or newer, data?
Can you generalise the code to do other things?

R Markdown = Markdown + R Code

R Markdown is regular Markdown with R code and output sprinkled in. You can do everything you can with regular Markdown, but you can incorporate graphs, tables, and other R output directly in your document. You can create HTML, PDF, and Word documents, PowerPoint and HTML presentations, websites, books, and even interactive dashboards with R Markdown. This whole course website is created with R Markdown (and a package named blogdown).

rmarkdown and knitr is a powerful combination of packages for literate programming, reproducible analysis, and document generation, which can:

Combine R code and Markdown syntax
Produce documents in PDF , Microsoft Word and various types of HTML documents
In HTML format, it can incorporate “extras” like interactive graphics

An R Markdown file is a plain text file that uses the extension .Rmd and contains three (3) major components:

A YAML header surrounded by ---s. This is the metadata of the document and it tells you how it is formed - what the title is, the author, date, output, and other control information.
Chunks of R code surounded by ```
Text mixed with simple text formatting using the Markdown syntax

Code chunks are interspersed with text throughout the document. To complete the document, you “Knit” or “render” the document. Most of you probably knit the document by clicking the “Knit” button in the script editor panel. You can also do this programmatically from the console by running the command rmarkdown::render("example.Rmd").

When you knit the document you send your .Rmd file to knitr, a package for R that executes all the code chunks and creates a second markdown document (.md). That markdown document is then passed onto pandoc, a document rendering software program independent from R. Pandoc allows users to convert back and forth between many different document formats such as HTML, $\LaTeX$, Microsoft Word, etc. By splitting the workflow up, you can convert your R Markdown document into a wide range of output formats.

The documentation for R Markdown is extremely comprehensive, and their tutorials and cheatsheets are excellent—rely on those.

Here are the most important things you’ll need to know about R Markdown in this class:

Key terms

Document: A Markdown file where you type stuff
Chunk: A piece of R code that is included in your document. It looks like this:
```
```{r chunk_name}
# Code goes here
```
```
There must be an empty line before and after the chunk. The final three backticks must be the only thing on the line—if you add more text, or if you forget to add the backticks, or accidentally delete the backticks, your document will not knit correctly.
Knit: When you “knit” a document, R runs each of the chunks sequentially and converts the output of each chunk into Markdown. R then runs the knitted document through pandoc to convert it to HTML or PDF or Word (or whatever output you’ve selected).

You can knit by clicking on the “Knit” button at the top of the editor window, or by pressing ⌘⇧K on macOS or control + shift + K on Windows.

Add chunks

There are three ways to insert chunks:

Press ⌘⌥I on macOS or control + alt + I on Windows
Click on the “Insert” button at the top of the editor window
Manually type all the backticks and curly braces (don’t do this)

Chunk names

You can add names to chunks to make it easier to navigate your document. If you click on the little dropdown menu at the bottom of your editor in RStudio, you can see a table of contents that shows all the headings and chunks. If you name chunks, they’ll appear in the list. If you don’t include a name, the chunk will still show up, but you won’t know what it does.

To add a name, include it immediately after the {r in the first line of the chunk. Names cannot contain spaces, but they can contain underscores and dashes. All chunk names in your document must be unique.

A word of caution: If you use the same chunk name more than once, knitr will give you an error message and refuse to knit your Rmd document. So ifyou copy/paste a named chunk, make sure you give them unique names.

```{r name-of-this-chunk}
# Code goes here
```

Chunk options

There are a bunch of different options you can set for each chunk. You can see a complete list in the RMarkdown Reference Guide or at knitr’s website.

Options go inside the {r} section of the chunk:

```{r name-of-this-chunk, warning=FALSE, message=FALSE}
# Code goes here
```

The most common chunk options are these:

fig.width=5 and fig.height=3 (or whatever number you want): Set the dimensions for figures
echo=FALSE: The code is not shown in the final document, but the results are.
include=FALSE: The chunk still runs, but the code and results are not included in the final document
message=FALSE: Any messages that R generates (like all the notes that appear after you load a package) are omitted
warning=FALSE: Any warnings that R generates are omitted
eval = FALSE - prevents code from being evaluated. I use this in my notes for class when I want to show how to write a specific function but don’t need to actually use it.
error = TRUE - causes the document to continue knitting and rendering even if the code generates a fatal error. If you’re debugging your code, you might want to use this option. However, for the final version of your work, you do not want to allow errors to pass through unnoticed.

You can also set chunk options by clicking on the little gear icon in the top right corner of any chunk:

Inline chunks

You can also include R output directly in your text, which is really helpful if you want to report numbers from your analysis. To do this, use `r r_code_here`.

It’s generally easiest to calculate numbers in a regular chunk beforehand and then use an inline chunk to display the value in your text. For instance, this document…

```{r find-avg-mpg, echo=FALSE}
avg_mpg <- mean(mtcars$mpg)
```

The average fuel efficiency for cars from 1974 was `r round(avg_mpg, 1)` miles per gallon.

… would knit into this:

The average fuel efficiency for cars from 1974 was 20.1 miles per gallon.

Caching

By default, every time you knit a document R starts anew and no previous results are saved.

If you have code chunks that run computationally intensive tasks, like running a ggpairs() correlation/scatterplot matrix in a large dataset, you might want to store these results to be more efficient and save time. If you use cache = TRUE, R will do exactly this. The output of the chunk will be saved to a specially named file on disk. Now, every time you knit the document the cached results will be used instead of running the code fresh.

Output formats

You can specify what kind of document you create when you knit in the YAML front matter.

title: "My document"
output:
  html_document: default
  pdf_document: default
  word_document: default

You can also click on the down arrow on the “Knit” button to choose the output and generate the appropriate YAML. If you click on the gear icon next to the “Knit” button and choose “Output options”, you change settings for each specific output type, like default figure dimensions or whether or not a table of contents is included.

The first output type listed under output: will be what is generated when you click on the “Knit” button or press the keyboard shortcut (⌘⇧K on macOS; control + shift + K on Windows). If you choose a different output with the “Knit” button menu, that output will be moved to the top of the output section.

The indentation of the YAML section matters, especially when you have settings nested under each output type. Here’s what a typical output section might look like:

---
title: "My document"
author: "My name"
date: "January 13, 2020"
output: 
  html_document: 
    toc: yes
    fig_caption: yes
    fig_height: 8
    fig_width: 10
  pdf_document: 
    latex_engine: xelatex  # More modern PDF typesetting engine
    toc: yes
  word_document: 
    toc: yes
    fig_caption: yes
    fig_height: 4
    fig_width: 5
---

Each output format has various options to customize the appearance of the final document. One option for HTML documents is to add a table of contents through the toc option. To add any option for an output format, just add it in a hierarchical format like this:

---
title: "My report"
author: "My Name"
date: 2020-07-26
output:  
  html_document:
    toc: true
    toc_depth: 2

You can explicitly set the number of levels included in the table of contents with toc_depth (the default is 3).

Appearance and style

There are several options that control the visual appearance of HTML documents.

theme specifies the Bootstrap theme to use for the page (themes are drawn from the Bootswatch theme library). Valid themes include default, cerulean, journal, flatly, readable, spacelab, united, cosmo, lumen, paper, sandstone, simplex, and yeti.
highlight specifies the syntax highlighting style for code chunks. Supported styles include default, tango, pygments, kate, monochrome, espresso, zenburn, haddock, and textmate.

Other references

Eurostat Data

Fri, 02 Oct 2020 00:00:00 +0000

Eurostat Data with the `eurostat` package

The eurostat package provides access to well over 9000 datasets from the Eurostat. It may seem a challenging task to find the correct dataset, but you are essentially looking for the code that describes the dataset. We an get a table of contents, namely all of th ecodes contained in the eurostat database.

library(eurostat)
library(fpp2) # for time series decomposition
library(seasonal)
library(tmap) #mapping eurostat data

# Get Eurostat data listing
# Function get_eurostat_toc() downloads a table of contents of eurostat datasets. 
# The values in column ‘code’ should be used to download a selected dataset.
toc <- get_eurostat_toc()

# Check the first 20 rows 
head(toc, 20) %>% 
  kable()

title	code	type	last update of data	last table structure change	data start	data end	values
Database by themes	data	folder	NA	NA	NA	NA	NA
General and regional statistics	general	folder	NA	NA	NA	NA	NA
European and national indicators for short-term analysis	euroind	folder	NA	NA	NA	NA	NA
Business and consumer surveys (source: DG ECFIN)	ei_bcs	folder	NA	NA	NA	NA	NA
Consumer surveys (source: DG ECFIN)	ei_bcs_cs	folder	NA	NA	NA	NA	NA
Consumers - monthly data	ei_bsco_m	dataset	29.09.2020	29.09.2020	1980M01	2020M09	NA
Consumers - quarterly data	ei_bsco_q	dataset	29.09.2020	30.07.2020	1990Q1	2020Q3	NA
Business surveys - NACE Rev. 2 activity (source: DG ECFIN)	ei_bcs_bs	folder	NA	NA	NA	NA	NA
Industry - monthly data	ei_bsin_m_r2	dataset	29.09.2020	29.09.2020	1980M01	2020M09	NA
Industry - quarterly data	ei_bsin_q_r2	dataset	29.09.2020	30.07.2020	1980Q1	2020Q3	NA
Construction - monthly data	ei_bsbu_m_r2	dataset	29.09.2020	29.09.2020	1980M01	2020M09	NA
Construction - quarterly data	ei_bsbu_q_r2	dataset	29.09.2020	30.07.2020	1981Q1	2020Q3	NA
Retail sale - monthly data	ei_bsrt_m_r2	dataset	29.09.2020	29.09.2020	1984M01	2020M09	NA
Sentiment indicators - monthly data	ei_bssi_m_r2	dataset	29.09.2020	29.09.2020	1980M01	2020M09	NA
Services - monthly data	ei_bsse_m_r2	dataset	29.09.2020	29.09.2020	1988M01	2020M09	NA
Services - quarterly data	ei_bsse_q_r2	dataset	29.09.2020	30.07.2020	2001Q2	2020Q3	NA
Euro-zone Business Climate Indicator - monthly data	ei_bsci_m_r2	dataset	29.09.2020	29.09.2020	1985M01	2020M09	NA
Financial services - monthly data	ei_bsfs_m	dataset	29.09.2020	29.09.2020	2006M04	2020M09	NA
Financial services - quarterly data	ei_bsfs_q	dataset	29.09.2020	30.07.2020	2007Q3	2020Q3	NA
Employment expectations indicator	ei_bsee_m_r2	dataset	29.09.2020	29.09.2020	1980M01	2020M09	NA

House Price Index (HPI)

The Eurostat House Price Index (HPI) measures price changes of all residential properties purchased by households (flats, detached houses, terraced houses, etc.), both new and existing, independently of their final use and their previous owners. First, we node that the code id for this dataset is teicp270. Once we know the relevant code id, we can download eurostat data using the get_eurostat(id) function.

hpi <- get_eurostat(id="teicp270")
glimpse(hpi)

## Rows: 1,257
## Columns: 5
## $ indic  <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL...
## $ unit   <chr> "I15_NSA", "I15_NSA", "I15_NSA", "I15_NSA", "I15_NSA", "I15_...
## $ geo    <chr> "AT", "BE", "BG", "CY", "CZ", "DE", "DK", "EA", "EA19", "EE"...
## $ time   <date> 2017-04-01, 2017-04-01, 2017-04-01, 2017-04-01, 2017-04-01,...
## $ values <dbl> 114.2, 104.7, 115.4, 102.7, 119.1, 113.1, 110.5, 107.8, 107....

head(hpi,40) %>% 
  kable()

indic	unit	geo	time	values
TOTAL	I15_NSA	AT	2017-04-01	114.2
TOTAL	I15_NSA	BE	2017-04-01	104.7
TOTAL	I15_NSA	BG	2017-04-01	115.4
TOTAL	I15_NSA	CY	2017-04-01	102.7
TOTAL	I15_NSA	CZ	2017-04-01	119.1
TOTAL	I15_NSA	DE	2017-04-01	113.1
TOTAL	I15_NSA	DK	2017-04-01	110.5
TOTAL	I15_NSA	EA	2017-04-01	107.8
TOTAL	I15_NSA	EA19	2017-04-01	107.8
TOTAL	I15_NSA	EE	2017-04-01	108.4
TOTAL	I15_NSA	ES	2017-04-01	110.4
TOTAL	I15_NSA	EU	2017-04-01	109.0
TOTAL	I15_NSA	EU27_2020	2017-04-01	108.7
TOTAL	I15_NSA	EU28	2017-04-01	109.0
TOTAL	I15_NSA	FI	2017-04-01	103.0
TOTAL	I15_NSA	FR	2017-04-01	103.4
TOTAL	I15_NSA	HR	2017-04-01	104.5
TOTAL	I15_NSA	HU	2017-04-01	125.5
TOTAL	I15_NSA	IE	2017-04-01	115.9
TOTAL	I15_NSA	IS	2017-04-01	130.1
TOTAL	I15_NSA	IT	2017-04-01	99.6
TOTAL	I15_NSA	LT	2017-04-01	114.5
TOTAL	I15_NSA	LU	2017-04-01	112.2
TOTAL	I15_NSA	LV	2017-04-01	119.5
TOTAL	I15_NSA	MT	2017-04-01	108.9
TOTAL	I15_NSA	NL	2017-04-01	111.4
TOTAL	I15_NSA	NO	2017-04-01	115.5
TOTAL	I15_NSA	PL	2017-04-01	105.4
TOTAL	I15_NSA	PT	2017-04-01	115.5
TOTAL	I15_NSA	RO	2017-04-01	114.3
TOTAL	I15_NSA	SE	2017-04-01	116.0
TOTAL	I15_NSA	SI	2017-04-01	111.4
TOTAL	I15_NSA	SK	2017-04-01	113.1
TOTAL	I15_NSA	TR	2017-04-01	124.3
TOTAL	I15_NSA	UK	2017-04-01	111.2
TOTAL	PCH_Q1_NSA	AT	2017-04-01	2.4
TOTAL	PCH_Q1_NSA	BE	2017-04-01	-0.3
TOTAL	PCH_Q1_NSA	BG	2017-04-01	2.4
TOTAL	PCH_Q1_NSA	CY	2017-04-01	3.1
TOTAL	PCH_Q1_NSA	CZ	2017-04-01	2.5

Typically, the downloaded data has codes and abbreviations for all of the variables, but we can use label_eurostat to get a more verbose description.

house_price_index_data <-  hpi %>% 
  label_eurostat()

head(house_price_index_data,40) %>% 
  kable()

indic	unit	geo	time	values
Total	Index, 2015=100 (NSA)	Austria	2017-04-01	114.2
Total	Index, 2015=100 (NSA)	Belgium	2017-04-01	104.7
Total	Index, 2015=100 (NSA)	Bulgaria	2017-04-01	115.4
Total	Index, 2015=100 (NSA)	Cyprus	2017-04-01	102.7
Total	Index, 2015=100 (NSA)	Czechia	2017-04-01	119.1
Total	Index, 2015=100 (NSA)	Germany (until 1990 former territory of the FRG)	2017-04-01	113.1
Total	Index, 2015=100 (NSA)	Denmark	2017-04-01	110.5
Total	Index, 2015=100 (NSA)	Euro area (EA11-1999, EA12-2001, EA13-2007, EA15-2008, EA16-2009, EA17-2011, EA18-2014, EA19-2015)	2017-04-01	107.8
Total	Index, 2015=100 (NSA)	Euro area - 19 countries (from 2015)	2017-04-01	107.8
Total	Index, 2015=100 (NSA)	Estonia	2017-04-01	108.4
Total	Index, 2015=100 (NSA)	Spain	2017-04-01	110.4
Total	Index, 2015=100 (NSA)	European Union (EU6-1958, EU9-1973, EU10-1981, EU12-1986, EU15-1995, EU25-2004, EU27-2007, EU28-2013, EU27-2020)	2017-04-01	109.0
Total	Index, 2015=100 (NSA)	European Union - 27 countries (from 2020)	2017-04-01	108.7
Total	Index, 2015=100 (NSA)	European Union - 28 countries (2013-2020)	2017-04-01	109.0
Total	Index, 2015=100 (NSA)	Finland	2017-04-01	103.0
Total	Index, 2015=100 (NSA)	France	2017-04-01	103.4
Total	Index, 2015=100 (NSA)	Croatia	2017-04-01	104.5
Total	Index, 2015=100 (NSA)	Hungary	2017-04-01	125.5
Total	Index, 2015=100 (NSA)	Ireland	2017-04-01	115.9
Total	Index, 2015=100 (NSA)	Iceland	2017-04-01	130.1
Total	Index, 2015=100 (NSA)	Italy	2017-04-01	99.6
Total	Index, 2015=100 (NSA)	Lithuania	2017-04-01	114.5
Total	Index, 2015=100 (NSA)	Luxembourg	2017-04-01	112.2
Total	Index, 2015=100 (NSA)	Latvia	2017-04-01	119.5
Total	Index, 2015=100 (NSA)	Malta	2017-04-01	108.9
Total	Index, 2015=100 (NSA)	Netherlands	2017-04-01	111.4
Total	Index, 2015=100 (NSA)	Norway	2017-04-01	115.5
Total	Index, 2015=100 (NSA)	Poland	2017-04-01	105.4
Total	Index, 2015=100 (NSA)	Portugal	2017-04-01	115.5
Total	Index, 2015=100 (NSA)	Romania	2017-04-01	114.3
Total	Index, 2015=100 (NSA)	Sweden	2017-04-01	116.0
Total	Index, 2015=100 (NSA)	Slovenia	2017-04-01	111.4
Total	Index, 2015=100 (NSA)	Slovakia	2017-04-01	113.1
Total	Index, 2015=100 (NSA)	Turkey	2017-04-01	124.3
Total	Index, 2015=100 (NSA)	United Kingdom	2017-04-01	111.2
Total	Percentage change q/q-1 (NSA)	Austria	2017-04-01	2.4
Total	Percentage change q/q-1 (NSA)	Belgium	2017-04-01	-0.3
Total	Percentage change q/q-1 (NSA)	Bulgaria	2017-04-01	2.4
Total	Percentage change q/q-1 (NSA)	Cyprus	2017-04-01	3.1
Total	Percentage change q/q-1 (NSA)	Czechia	2017-04-01	2.5

We note that our dataframe contains both the value of the index (unit = I15_NSA), as well as the percentage change (unit = PCH_Q1_NSA). We will select the I15_NSA index, a few countries and the EU-28 index, and plot the evolution of house prices over time.

hpi_data <- hpi %>% 
  
  # choose the UK, France, Poland, Spain, Portugal, Germany, Italy, and the EU28
  filter(geo %in%  c("UK", "FR", "PL", "ES","PT", "DE","IT","EU28") ) %>%  
  
  # choose value of the index (unit =   `I15_NSA`) 
    filter(unit == "I15_NSA")

ggplot(hpi_data, aes(x=time, y=values, group=geo, colour=geo))+
  geom_point()+
  geom_line()+
  theme_bw()+
  labs(
    title= "House price index in the EU (2015 = 100)",
    x = "Time",
    y = "Housing Price Index", 
    caption = "Source: Eurostat, code id = teicp270"
  )

Tourism Seasonality in the Meditteranean

The eurostat database has a dedicated tourism section. I wanted to check monthly nights spent at hotels– the relevant code id = tour_occ_nim in the four Meditteranean countries, Portugal, Spain, Italy, and Greece since 2000.

The code below downloads the data and plots time series plots for all countries.

# create a dataframe tourism_data that contains the eurostat data for
# code id = "tour_occ_nim", namely value of monthly nights spent at hotels
tourism_data <- get_eurostat(id="tour_occ_nim")

med_tourism <-  tourism_data %>%   
  
  # choose Portugal, Spain, Italy, and Greece
  filter(geo %in%  c("PT", "ES", "IT", "EL" ) ) %>%
  
  #use label_eurostat to get verbose descriptions of codes
  label_eurostat() %>% 
  
  # choose number of total hotel accommodations since Jan 1, 2000
  filter (c_resid == "Total", 
          nace_r2 == "Hotels and similar accommodation", 
          unit == "Number",
          time >= "2000-01-01") %>% 
  
  # express values in million of nights
  mutate(values = values/1000000) 

ggplot(med_tourism, aes(x=time, y=values, group=geo, colour=geo))+
  geom_point()+
  geom_line()+
  geom_smooth(se=FALSE)+
  facet_wrap(~geo)+
  theme_bw()+
  labs(title="Hotel stays in the Medditeranean, 2000-present", 
       y= "Millions of nights spent in hotels",
       x = "Year",
       caption = "Source: Eurostat, code = tour_occ_nim")+
  theme(legend.position="none")

All countries exhibit the same seasonal pattern: there is a peak in July-August, and the minimum number is around December-January.

Look at the impact of Covid-19 on all countries!

#first define **ts** (time series ) objects; one for each country  

portugal_tourism <- med_tourism %>% 
  
  #select the country you are interested in, in this case Portugal
  filter (geo == "Portugal") %>% 
  
  #sort by time in ascending order, so  earliest observation is first
  arrange(time) %>%
  
  #we just want to keep the values 
  select(values) %>% 
  
  #time series (ts) starts Jan 2000 and has monthlyfrequency (12 months/yr)
  ts(start=2000, frequency = 12) 



spain_tourism <- med_tourism %>% 
  filter (geo == "Spain") %>% 
  arrange(time) %>% 
  select(values) %>% 
  ts(start=2000, frequency = 12)

italy_tourism <- med_tourism %>% 
  filter (geo == "Italy") %>% 
  arrange(time) %>% 
  select(values) %>%   
  ts(start=2000, frequency = 12)

greece_tourism <- med_tourism %>% 
  filter (geo == "Greece") %>% 
  arrange(time) %>% 
  select(values) %>%   
  ts(start=2000, frequency = 12)


#Season plot for Spain and Greece: the seasonal pattern is consistent since 2000
ggseasonplot(spain_tourism, year.labels=TRUE, year.labels.left=TRUE) +
  labs(
    title = "Seasonal plot: Hotel stays in Spain",
    y = "Millions of nights spent in hotels"
  )+
    theme_bw()

ggseasonplot(greece_tourism, year.labels=TRUE, year.labels.left=TRUE) +
  labs(
    title = "Seasonal plot: Hotel stays in Greece",
    y = "Millions of nights spent in hotels"
  )+
  theme_bw()

An interesting question is which country has the greatest seasonality distortion, namely, how much bigger is the summer peak from the winter bottom. For this we produce a subseries plot, one that emphasises the seasonal patterns and where the data for each season are collected together in separate mini time plots. The horizontal lines indicate the means for each month. This form of plot enables the underlying seasonal pattern to be seen clearly, and also shows the changes in seasonality over time. It is especially useful in identifying changes within particular seasons.

ggsubseriesplot(portugal_tourism)+
  labs(
    title = "Seasonal subseries plot: Hotel stays in Portugal 2000-present",
    subtitle = "Horizontal lines indicate monthly averages",
    y = "Millions of nights spent in hotels", 
    caption = "Source:Eurostat"
  )+
  theme_bw()

ggsubseriesplot(spain_tourism)+
  labs(
    title = "Seasonal subseries plot: Hotel stays in Spain 2000-present",
    subtitle = "Horizontal lines indicate monthly averages",
    y = "Millions of nights spent in hotels", 
    caption = "Source:Eurostat"
  )+
  theme_bw()

ggsubseriesplot(italy_tourism)+
  labs(
    title = "Seasonal subseries plot: Hotel stays in Italy 2000-present",
    subtitle = "Horizontal lines indicate monthly averages",
    y = "Millions of nights spent in hotels", 
        caption = "Source:Eurostat"
  )+
  theme_bw()

ggsubseriesplot(greece_tourism)+
  labs(
    title = "Seasonal subseries plot: Hotel stays in Greece 2000-present",
    subtitle = "Horizontal lines indicate monthly averages",
    y = "Millions of nights spent in hotels", 
    caption = "Source:Eurostat"
  )+
  theme_bw()

Visually, the approximate ratio of max:min averages for each of the four Mediterranean countries is as follows:

Portugal 6:2 = 3
Spain 39:13 = 3
Italy 43:10 = 4.3
Greece 13.5:1= 13.5

Disposable income of private households by NUTS 2 regions

Using the eurostat data, we can create maps of, e.g., disposable income at a regional level. NUTS or Nomenclature of Territorial Units for Statistics is a geocode standard for referencing subdvisions (regions, counties, districts, etc.) within a country.

We will work with the Disposable income of private households by NUTS 2 regions database

income_data <- get_eurostat(id="tgs00026") %>% 
  select(geo,time,values) %>% 
  dplyr::mutate(cat = cut_to_classes(values))


income_2016 <- income_data %>% 
  filter(time == "2016-01-01")

# Download geospatial data from GISCO
geodata <- get_eurostat_geospatial(output_class = "sf",
                                   resolution = "60",
                                   nuts_level = 2,
                                   year = 2016) 


map_data <- inner_join(geodata, income_2016)


ggplot(data=map_data) + geom_sf(aes(fill=cat),color="dim grey", size=.1) + 
  scale_fill_brewer(palette = "Accent") +
  guides(fill = guide_legend(reverse=T, title = "euro")) +
  labs(title="Disposable household income in 2016",
       caption="(C) EuroGeographics for the administrative boundaries 
                Map produced in R with a help from Eurostat-package <github.com/ropengov/eurostat/>") +
  theme_light() + theme(legend.position=c(.8,.8)) +
  coord_sf(xlim=c(-12,44), ylim=c(35,70))

Acknowledgments

This page is derived in part from Tutorial (vignette) for the eurostat R package.

Using Github

Tue, 25 Aug 2020 00:00:00 +0000

Introduction to Git/Github

When you engage in any kind of data science or programming, there comes a (frustrating) point that you need to understand how Git and GitHub work. Learning how to use Git and GitHub is especially important for keeping versions of your work (think something like Dropbox + MS Word’s Track Changes) and collaborating with others.

Git is essentially a boring time machine. Remember when you worked on a Word file and saved it by adding the date, or calling it “mywork-vesion1”, “mywork-final”, “mywork-final-final”, etc?

Git is organised around repositories; repos are folders where you keep a project with all necessary files (code, data, images, etc). So you first need to tell git which files/folders to keep track of for any changes you will be making.

As you keep adding code to your project/assignment/etc, you commit changes into your repository and you add an explanatory comment, or message to yourself briefly describing the changes/additions/new work you have done.

When you commit changes, it’s as though you take a snapshot of your work and write a short comment to yourself; it would be the same as saving your Word document adding today’s date in the filename, or v1, v2, final, final-final, etc.

After committing your changes, you need to pull first, so you get the latesr copy from git and then push them to git– this is when you actually upload changes, etc.

Git workflow

The following lists the main steps to create a repository and keep it updated

Create a repo on GitHub and initialize with a README.
Clone the repo to your local machine. You can either do it as an RStudio Project, or using a shell command: $ git clone REPOSITORY-URL
Add or Stage any changes you make: $ git add -A
Commit your changes: $ git commit -m "Helpful message to yourself/collaborators"
Pull from GitHub: $ git pull
Push your changes to GitHub: $ git push

Repeat steps 3—7, but especially steps 3-4, often.

Git keeps track of all the changes you have made in your repo, just in case you made a mistake and need to go back to an earlier version where things actually worked. GitHub is a website built on top of Git that allows you to collaborate on code with others, in helping with code fixes, documentation, and more.

Further resources

For R users, Jenny Bryan et al have created Happy Git with R, a brilliant resource that shows you how to use Git and GitHub in RStudio effectively.

One final thing: git can be confusing and frustrating as hell (ask me for details)– add git to the challenges of coding and you sometimes end up with people asking themselves interesting questions.

When things do go wrong (they will), have a look at https://ohshitgit.com/ and http://happygitwithr.com/burn.html

H-E-L-P!

Sat, 25 Jul 2020 00:00:00 +0000

Overwiew

I recently got an ad on my phone trying to sell me a service to become a data scientist in a month. This may make you think that doing data science with R is an easy, straight-forward process.

It is not.

Any time I have been struggling to learn a new tool/technology/software … I go back to this short clip I cut out of @hadleywickham’s 2014 @user2014_ucla tutorial on Dplyr to motivate myself to keep pushing through … learning new tools is hard for everyone at the beginning!- Brain AMA
— Aliakbar Akbaritabar (Ali) (@Akbaritabar) July 25, 2018

You will stumble, get frustrated, lost, and confused, make (silly) mistakes even when you think you know stuff, not understand how to perform a task, not understand why your code is generating an error, etc. It still happens to me all the time, and I am still googling really basic stuff about R, even after quite a few years using it. But as Alfred so helpfully points out to Bruce Wayne in Batman Begins, do not fall to pieces when you fail. Instead, learn to pick yourself up, learn from experience, practice more, and get better.

Back in 2018, Hadley Wickham, one the major driving forces behind the tidyverse, recorded a video of his live analysis of a dataset with the goal of demonstrating his approach.

It’s great to see Hadley, a true expert, undertake data analysis; he makes quite a few mistakes in this video and he even forgets the arguments for routines/packages he has written! But it’s even more powerful that he shrugs off the mistake, corrects it, and moves forward.

Even more interesting is to see Hadley, the author of the ggplot2 package, admit to using Google to look things up in… ggplot2!

Error messages in R

Error messages are a normal part of working in R, not a sign you are bad. To make matters worse, R will alert you with red letters not just for errors, but for warnings, too. It helps to learn relatively early on how to decipher these messages and what common ones mean.

First, if after typing a command you see red letters, don’t panic– it may just be a warning, and most of the times you can ignore them or worry about them later.

But you will get errors (in red letters too!) As an example let me try to read a CSV file using the read_csv function

There are three main parts to an error:

The declaration that it is an Error, and not a Warning
The location of the error: it is in the read_csv("myfile.csv") line of my code
The issue my code caused: could not find function "read_csv", as I asked R to use a function from the readr package but forgot to load it.

Let me try again.

The error given now is again produced by the same read_csv function, but the error is that the CSV file does not exist in the working directory.

Failure, and the 15 minute rule

It’s good practice to follow the 15 minute rule. If you encounter a problem in your work, spend 15 minutes troubleshooting the problem on your own; Google, RStudio Support, and StackOverflow are good places to look for answers. So if you google your error message, you will find that 99% of the time someone has had the same error message and the solution is on stackoverflow.

However, if after 15 minutes you still cannot solve the problem, ask for help.

15 min rule: when stuck, you HAVE to try on your own for 15 min; after 15 min, you HAVE to ask for help.- Brain AMA
— Rachel Thomas (@math_rachel) August 15, 2016

The `reprex` package

How should you ask for help? You must provide enough information so others can understand what is the issue with your code and try to reproduce the issue on their own computer.Stackoverflow provides advice not only on technical questions but also on how to ask good questions! A very popular post addresses how to make a great R reproducible example:

The reprex package, written by Jenny Bryan, was developed to help create reproducible examples, so others can reproduce your code, run it, and see where the issue is.

`reprex` with copy-paste

Reprex works with whatever is currently saved on your clipboard. The easiest way to use reprex is to highlight with your mouse the part of code that gives you an error and copy it to your clipboard using Command+c (Mac) or Control+c(Windows)).

Now that the code has been highlighted, you can easily just type reprex() and the reprex code will now be on the clipboard, which means you can paste it directly into a new Rmd file

x <- 1:10
mean(x)
#> [1] 5.5

`reprex` directly with the reprex() command

Besides copy-and-paste which is the easiest way to use reprex, you can include the code you want to share ore debug directly into the reprex() command. Let us look at a few examples.

reprex(gapminder %>% summarise(lifeExp))
#> Error in gapminder %>% summarise(lifeExp): could not find function "%>%"

^{Created on 2019-07-16 by the reprex package (v0.3.0)}

The error message given is that it cannot find the pipe operator %>%, as we haven’t given the library(dplyr) command. reprex will ensure that all the necessary data and packages are loaded. The information above is now automatically stored on your clipboard, and you can paste it directly (with Ctrl/Cmd+c) as needed.

Let us load the library and try again.

reprex({library(dplyr); gapminder %>% summarise(lifeExp)})
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Error in eval(lhs, parent, parent): object 'gapminder' not found

dplyr is ok now, and the pipe operator works, but we now realise that the gapminder package has not been loaded; let’s try again.

reprex({library(dplyr); library(gapminder); gapminder %>% summarise(lifeExp)}
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Error: Column `lifeExp` must be length 1 (a summary value), not 1704

The error we get relates to the use of the summarise function; this function summarises many values into a single summary, like mean, min, median, etc. R tells us that lifeExp must be of length 1 (a single summary value) rather than 1704 values which is how many values lifeExp has.

Further Resources

Using SQL within R

Tue, 11 Aug 2020 00:00:00 +0000

SQL and `dbplyr`

This sort note will teach you the basics of using SQL databases with R. Sometimes, you have a massive dataset, made up of many different dataframes (or tables in SQL jargon), that would crash your computer’s memory if you try to load it. To interact with any database you typically use SQL, the Structured Query Language.

Rather than writing SQL commands, the dbplyr package automatically generates SQL commands from dplyr sequences. However, please keep in mind that SQL is a very large language, and dbplyr doesn’t do everything, but you can still get a lot out of it.

SQL commands vs dplyr verbs

One of the advantages of learning about tidy data and dplyr is that with dplyr verbs you can replicate a lot of what SQL does.

`SQL` command…	… translate to `dplyr` verb
SELECT	`select()` for columns `mutate()` for expressions `summarise()` for aggregates
FROM	which dataframe to use
WHERE	`filter()`
GROUP_BY	`group_by()`
ORDER_BY	`arrange()`
LIMIT	`head()`

Establish a connection with the SQLite database

We will use the European Soccer Database that has more than 25,000 matches and more than 11,000 players. We first need to establish a connection with the SQL database. Unlike dataframes that we just load into memory, the size of some SQL databases prohibits loading the entire database into memory. Although this soccer database is a locally saved file, we would use a similar connection into any SQL database over the internet

# set up a connection to sqlite database
football_db <- DBI::dbConnect(
  drv = RSQLite::SQLite(),
  dbname = "database.sqlite"
)

The general code for connecting to a remote database is:

connection_to_db <- DBI::dbConnect(
  drv = [database driver, eg odbc::odbc()],
  dbname = "database_name",
  user = "User_ID",
  password = "Password",
  host = "host name", (default=local connection)
  port = "port number" (default=local connection)
)

That’s pretty much it - R now has a direct connection to the database and you can start making queries.

Database objects or tibbles?

Now, an SQL database will typically contain multiple tables. You can think of these tables as R data frames (or tibbles). What are the tables in the soccer database? We can browse the tables in the database using DBI::dbListTables()

DBI::dbListTables(football_db)

## [1] "Country"           "League"            "Match"            
## [4] "Player"            "Player_Attributes" "Team"             
## [7] "Team_Attributes"   "sqlite_sequence"

We can easily set these tables up as database objects using dplyr

countries <- dplyr::tbl(football_db, "Country")
leagues <- dplyr::tbl(football_db, "League")
matches <- dplyr::tbl(football_db, "Match")
teams <- dplyr::tbl(football_db, "Team")
team_attributes <- dplyr::tbl(football_db, "Team_Attributes")
players <- dplyr::tbl(football_db, "Player")
player_attributes <- dplyr::tbl(football_db, "Player_Attributes")

Each of these tables are SQL database objects in your R session which you can manipulate in the same way as a dataframe.

class(countries)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

When you define these tables, you are not physically downloading them, just creating a bare minimum extract to work with.

IF you wanted to handle these as normal dataframes or tibbles, you can simply pipe the database objects to as_tibble()

player_attributes_df <- player_attributes %>% as_tibble()
class(player_attributes)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

class(player_attributes_df)

## [1] "tbl_df"     "tbl"        "data.frame"

Notice the difference between player_atributes, a database object, and player_atributes_df, a ‘regular’ dataframe/tibble.

Now that we have player attributes as a dataframe, we can handle it the usual way and, e.g., build a scatterplot/correlation matrix wth ggpairs()

player_attributes_df %>% 
  filter(!is.na(preferred_foot)) %>% 
  select(preferred_foot, ball_control, overall_rating) %>% 
  ggpairs(aes(colour=preferred_foot, alpha = 0.3))+
  scale_colour_manual(values = c("#67a9cf","#ef8a62"))+
  scale_fill_manual(values = c("#67a9cf","#ef8a62"))+
  theme_bw()

Querying the database with `dbplyr`

To create the ggpairs() plot we had to convert a database table to a dataframe, load it all in the computer’s memory, and then use ggplot. The beauty of working with databases is that we do NOT have to load everything into memory. Instead, all dplyr calls are evaluated lazily, generating SQL code that is only sent to the database when you request the data.

Let us look at an example. What if we wanted to calculate the average number of goals per league (country) per season and then plot those averages. This seems like a data wrangling exercise we would do with dplyr

# write dplyr code that will calculate average number of goals per country per season
goals_per_match <-  matches %>%
  group_by(country_id, season) %>%
  summarise(avg_goals = mean(home_team_goal + away_team_goal)) %>%
 
  #do a left_join, so we know the country's name rather than the country's ID
  left_join(countries, by = c("country_id"="id")) %>%
  arrange(desc(avg_goals)) %>% 
  ungroup()

What kind of an object is goals_per_match?

#what kind of an object is goals_per_match?
class(goals_per_match)

## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"

goals_per_match is not a dataframe (tibble), but rather a querty to an SQLite database table.

Generate the actual SQL commands

We are familiar with all the dplyr verbs (filter, select, group_by, summarise, arrange, etc.), but SQL has its own commands, all of which are written in capital letters (is SQL constantly angry and shouting? Who knew…). We can generate the actual SQL commands using dbplyr::sql_render() or dplyr::show_query()

# Generate actual SQL commands: We can either use dbplyr::sql_render() or dplyr::show_query()
dbplyr::sql_render(goals_per_match)

## <SQL> SELECT *
## FROM (SELECT `LHS`.`country_id` AS `country_id`, `LHS`.`season` AS `season`, `LHS`.`avg_goals` AS `avg_goals`, `RHS`.`name` AS `name`
## FROM (SELECT `country_id`, `season`, AVG(`home_team_goal` + `away_team_goal`) AS `avg_goals`
## FROM `Match`
## GROUP BY `country_id`, `season`) AS `LHS`
## LEFT JOIN `Country` AS `RHS`
## ON (`LHS`.`country_id` = `RHS`.`id`)
## )
## ORDER BY `avg_goals` DESC

goals_per_match %>% show_query()

## <SQL>
## SELECT *
## FROM (SELECT `LHS`.`country_id` AS `country_id`, `LHS`.`season` AS `season`, `LHS`.`avg_goals` AS `avg_goals`, `RHS`.`name` AS `name`
## FROM (SELECT `country_id`, `season`, AVG(`home_team_goal` + `away_team_goal`) AS `avg_goals`
## FROM `Match`
## GROUP BY `country_id`, `season`) AS `LHS`
## LEFT JOIN `Country` AS `RHS`
## ON (`LHS`.`country_id` = `RHS`.`id`)
## )
## ORDER BY `avg_goals` DESC

Run SQL query and retrieve results

Now that we have the SQL query we can retrieve the results into a local dataframe (tibble) using collect(). The main difference is that rather than loading all of the databases in memory, the goals_per_match goes to the SQL database, collects the necessary data, and only returns

# execute query and retrieve results in a tibble (dataframe). 
goals_match_tibble <- goals_per_match %>% 
  collect()

#have a look at the resulting dataframe with glimpse() and skim()
glimpse(goals_match_tibble)

## Rows: 88
## Columns: 4
## $ country_id <int> 24558, 13274, 13274, 13274, 7809, 13274, 24558, 13274, 2...
## $ season     <chr> "2009/2010", "2011/2012", "2010/2011", "2013/2014", "201...
## $ avg_goals  <dbl> 3.33, 3.26, 3.23, 3.20, 3.16, 3.15, 3.14, 3.08, 3.00, 2....
## $ name       <chr> "Switzerland", "Netherlands", "Netherlands", "Netherland...

skimr::skim(goals_match_tibble)

Table 1: Data summary
Name	goals_match_tibble
Number of rows	88
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
season	0	1	9	9	0	8	0
name	0	1	5	11	0	11	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
country_id	0	1	12452.09	7877.88	1.00	4769.00	13274.00	19694.00	24558.00	▇▂▅▅▇
avg_goals	0	1	2.71	0.24	2.18	2.56	2.71	2.86	3.33	▂▆▇▃▂

The resulting dataframe only has the information we want: 4 variables (columns) and 88 rows (cases); for each country and season, we have the average number of goals scored per match. We can now use this smaller dataframe and plot the results.

# plot results, using goals_match_tibble
ggplot(goals_match_tibble) + 
  geom_point(aes(x=reorder(name, avg_goals),y=avg_goals, colour=name))+
  theme_bw(8)+ 
  facet_wrap(~season, nrow=4)+
  labs(
    title = "Which football leagues had the higest number of goals per game?",
    y = "Average Number of Goals per Match",
   x = "National League"
    ) + 
  coord_flip() +
  theme(legend.position = "none")

ggplot(goals_match_tibble, aes(x=reorder(name, avg_goals),y=avg_goals, colour=name)) + 
  geom_violin()+
  # geom_boxplot()+
  geom_jitter()+
  theme_bw()+
  labs(
    title = "Which football leagues had the higest number of goals per game?",
    subtitle="2008/2009 to 2015/2016",
    y = "Average Number of Goals per Match",
    x = "National League"
  ) + 
  coord_flip() +
  theme(legend.position = "none")

We cannow run queries on the database, collect the results in a local dataframe, and show the results of, e.g., the highest overall rating of all players in the database.

# Which are the top 20 players by overall rating (`overall_rating`)?
top_players <-  player_attributes %>%
  group_by(player_api_id) %>%
  summarise(max_rating = max(overall_rating)) %>% 
  arrange(desc(max_rating)) %>% 
  left_join(players, by = c("player_api_id"="player_api_id")) %>%
  collect

top_players %>% 
  head(20) %>% 
  kableExtra::kable()

player_api_id	max_rating	id	player_name	player_fifa_api_id	birthday	height	weight
30981	94	6176	Lionel Messi	158023	1987-06-24 00:00:00	170	159
30893	93	1995	Cristiano Ronaldo	20801	1985-02-05 00:00:00	185	176
30829	93	10749	Wayne Rooney	54050	1985-10-24 00:00:00	175	183
30717	93	3826	Gianluigi Buffon	1179	1978-01-28 00:00:00	193	201
39989	92	3994	Gregory Coupet	1747	1972-12-31 00:00:00	180	176
39854	92	10861	Xavi Hernandez	10535	1980-01-25 00:00:00	170	148
34520	91	3183	Fabio Cannavaro	1183	1973-09-13 00:00:00	175	165
30955	91	742	Andres Iniesta	41	1984-05-11 00:00:00	170	150
30743	91	9216	Ronaldinho	28130	1980-03-21 00:00:00	183	168
30723	91	388	Alessandro Nesta	1088	1976-03-19 00:00:00	188	174
30657	91	4366	Iker Casillas	5479	1981-05-20 00:00:00	185	185
30627	91	5120	John Terry	13732	1980-12-07 00:00:00	188	198
30626	91	10203	Thierry Henry	1625	1977-08-17 00:00:00	188	183
41044	90	5592	Kaka	138449	1982-04-22 00:00:00	185	183
40636	90	6377	Luis Suarez	176580	1987-01-24 00:00:00	183	187
38843	90	11039	Ze Roberto	28765	1974-07-06 00:00:00	173	159
35724	90	11057	Zlatan Ibrahimovic	41236	1981-10-03 00:00:00	196	209
30924	90	3514	Franck Ribery	156616	1983-04-07 00:00:00	170	159
30834	90	951	Arjen Robben	9014	1984-01-23 00:00:00	180	176
30728	90	2426	David Trezeguet	5984	1977-10-15 00:00:00	190	176

# Which are the top 20 goalkeepers by sum of all gk attributes (`gk_diving`,`gk_handling`, `gk_kicking`, etc)?
top_goalies <-  player_attributes %>%
  mutate(goalie_rating = gk_diving + gk_handling + gk_kicking + gk_positioning + gk_reflexes) %>% 
  group_by(player_api_id) %>%
  summarise(max_goalie_rating = max(goalie_rating)) %>% 
  arrange(desc(max_goalie_rating)) %>% 
  left_join(players, by = c("player_api_id"="player_api_id")) %>%
  collect
  

top_goalies %>% 
  head(20) %>% 
  kableExtra::kable()

player_api_id	max_goalie_rating	id	player_name	player_fifa_api_id	birthday	height	weight
30717	449	3826	Gianluigi Buffon	1179	1978-01-28 00:00:00	193	201
39989	447	3994	Gregory Coupet	1747	1972-12-31 00:00:00	180	176
30859	445	8580	Petr Cech	48940	1982-05-20 00:00:00	196	198
30657	442	4366	Iker Casillas	5479	1981-05-20 00:00:00	185	185
27299	440	6556	Manuel Neuer	167495	1986-03-27 00:00:00	193	203
30989	438	5536	Julio Cesar	48717	1979-09-03 00:00:00	185	174
24503	437	9579	Sebastian Frey	1289	1980-03-18 00:00:00	190	198
30726	436	2900	Edwin van der Sar	51539	1970-10-29 00:00:00	198	196
182917	429	2340	David De Gea	193080	1990-11-07 00:00:00	193	181
30660	428	8541	Pepe Reina	24630	1982-08-31 00:00:00	188	203
30622	426	8413	Paul Robinson	13914	1979-10-15 00:00:00	193	198
32657	425	10625	Victor Valdes	106573	1982-01-14 00:00:00	183	172
30742	425	7470	Mickael Landreau	3813	1979-05-14 00:00:00	183	185
26295	425	4272	Hugo Lloris	167948	1986-12-26 00:00:00	188	172
30841	424	6446	Maarten Stekelenburg	2147	1982-09-22 00:00:00	198	203
27341	424	9028	Robert Enke,30	158400	1977-08-24 00:00:00	185	172
30648	423	4832	Jens Lehmann	805	1969-11-10 00:00:00	190	192
33986	421	746	Andres Palop	8247	1973-10-22 00:00:00	183	165
31293	421	10009	Steve Mandanda	163705	1985-03-28 00:00:00	185	181
30380	420	1345	Brad Friedel	11983	1971-05-18 00:00:00	188	203

Other references

Other Data Sources

Fri, 31 Jul 2020 00:00:00 +0000

The web is a vast source of datasets on almost any subject, such as demographics, disease, economics, finance, geography, entertainment, science, etc. You can always start with Google’s Dataset Search that indexes thousands of public datasets.

Here are some more suggestions:

Kaggle: Kaggle hosts machine learning competitions and contains a large number of datasets that are generally free and open to the public.
Awesome Public Datasets: Collection of public datasets, arranged by area
Reddit
UK data and UK Office for National Statistics
U.S. Government’s open data with many datasets on a range of issues
Data is plural: a weekly newsletter that has collected over a thousand useful/curious datasets. This may well be one of my favourite dataset collections!
Our World in Data contains time series of demographic and global development data. Their collection of Covid-19 data is among the best.
TidyTuesday: A weekly data project in R where they release a new dataset every week and emphasis is placed on understanding how to summarise and arrange data to make meaningful charts with ggplot2, tidyr, dplyr, and other tools in the tidyverse ecosystem.
fivethirtyeight.com is a data-driven journalism site that share the data on most of their stories
In terms of investigative journalism, The Markup and ProPublica are both data-driven and share their data; All Markup data is freely available and ProPublica provides many of their datasets for free
Erik Gahner’s list of political science datasets: Datasets divided by topic (governance, elections, policy, political elites, etc.), geography (country, region), etc.
BigQuery public datasets Google has set up BigQuery which is a data warehouse for some large datasets that you really need to access with SQL. There is even an R package bigrquery that allows you to easily talk with BigQuery’s database.

Resources and further references | Data Analytics: Learning from Data

Finance Data

Finance data with the tidyquant package

Calculating financial returns

Summarising the data set

Minimum and maximum price of each stock by quarter

Sharpe Ratio

Investment Growth

Scatterplots of individual stocks returns versus S&P500 Index returns

Creating a portfolio of assets

Creating various portfolios by changing weights of assets

Data from the Federal Reserve Economic Data with tidyquant

Acknowledgments

Installing R and RStudio

Installing R & RStudio

Install XCode if you have a Mac

Install R

Install RStudio IDE

Change character encoding to UTF-8, and UTF-8 only

Exiting R & RStudio

Updating R and RStudio

R commands

Assignmnent Operator <-

R is case sensitive

Typos

Comments

R knows you’re not finished

Arithmetic Operations and Functions

RStudio help

Tab autocomplete

The history pane

More resources

Acknowledgements

Textbooks and other resources

Textbooks/Readings

R Programming

Statistics with R

Visualisations

Spatial Visualisations

Online resources

Software

Companies, Government Agencies, and NGOs Using R

Podcasts

Using Markdown

Basic Markdown formatting

Heading 1

Heading 2

Heading 3

Mathematical formulas

Tables

Front matter

Other references

World Bank Data

Population Growth 1970-2017

World Happiness: how does it correlate with various indicators

Acknowledgments

Installing the tidyverse

Installing the tidyverse

Installing the tidyverse if you have a Mac

Installing further packages

Install from Github

Updating packages

Updating R

Install tinytex

Using R Markdown

Reproducibility in scientific research

R Markdown = Markdown + R Code

Key terms

Add chunks

Chunk names

Chunk options

Inline chunks

Caching

Output formats

Table of contents

Appearance and style

Other references

Eurostat Data

Eurostat Data with the eurostat package

House Price Index (HPI)

Finance data with the `tidyquant` package

Data from the Federal Reserve Economic Data with `tidyquant`

Install `XCode` if you have a Mac

Change character encoding to `UTF-8`, and `UTF-8` only

Assignmnent Operator `<-`

Installing the `tidyverse`

Installing the `tidyverse` if you have a Mac

Install `tinytex`

Eurostat Data with the `eurostat` package

The `reprex` package

`reprex` with copy-paste

`reprex` directly with the reprex() command

SQL and `dbplyr`

Querying the database with `dbplyr`