Final Group Project: AirBnB analytics

Due by 11:55 PM on Sunday, November 14, 2021

In your final group assignment, you have to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. You can download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.

All of the listings are a GZ file, namely they are archive files compressed by the standard GNU zip (gzip) compression algorithm. You can download, save and extract the file if you wanted, but vroom::vroom() or readr::read_csv() can immediately read and extract this kind of a file. You should prefer vroom() as it is faster, but if vroom() is limited by a firewall, please use read_csv() instead.

listings <- vroom("http://data.insideairbnb.com/germany/bv/munich/2020-06-20/data/listings.csv.gz")

vroom will download the *.gz zipped file, unzip, and provide you with the dataframe.

Even though there are many variables in the dataframe, here is a quick description of some of the variables collected, with cost data typically expressed in US$

  • price = cost per night

  • cleaning_fee: cleaning fee

  • extra_people: charge for having more than 1 person

  • property_type: type of accommodation (House, Apartment, etc.)

  • room_type:

    • Entire home/apt (guests have entire place to themselves)
    • Private room (Guests have private room to sleep, all other rooms shared)
    • Shared room (Guests sleep in room shared with others)
  • number_of_reviews: Total number of reviews for the listing

  • review_scores_rating: Average review score (0 - 100)

  • longitude , latitude: geographical coordinates to help us locate the listing

  • neighbourhood*: three variables on a few major neighbourhoods in each city

Exploratory Data Analysis (EDA)

In the R4DS Exploratory Data Analysis chapter, the authors state:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation… EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Conduct a thorough EDA. Recall that an EDA involves three things:

  • Looking at the raw values.
    • dplyr::glimpse()
  • Computing summary statistics of the variables of interest, or finding NAs
    • mosaic::favstats()
    • skimr::skim()
  • Creating informative visualizations.
    • ggplot2::ggplot()
      • geom_histogram() or geom_density() for numeric continuous variables
      • geom_bar() or geom_col() for categorical variables
    • GGally::ggpairs() for scaterrlot/correlation matrix
      • Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = gender, alpha = 0.4)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:

  • How many variables/columns? How many rows/observations?
  • Which variables are numbers?
  • Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
  • What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

Data wrangling

Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.

Notice that some of the price data (price, cleaning_fee, extra_people) is given as a character string, e.g., “$176.00”

Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the frst number

listings <- listings %>% 
  mutate(price = parse_number(price))

Use typeof(listing$price) to confirm that price is now stored as a number.

Handling missing values (NAs)

Use skimr::skim() function to view a summary of the cleaning_fee data. This is also stored as a character, so you have to turn it into a number, as discussed earlier.

  • How many observations have missing values for cleaning_fee?
  • What do you think is the most likely reason for the missing observations of cleaning_fee? In other words, what does a missing value of cleaning_fee indicate?

cleaning_fee an example of data that is missing not at random, since there is a specific pattern/explanation to the missing data.

Fill in the code below to impute the missing values of cleaning_fee with an appropriate numeric value. Then use skimr::skim() function to confirm that there are no longer any missing values of cleaning_fee.

listings <- listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ ______, 
    TRUE ~ cleaning_fee
  ))

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","______", "______","______") ~ property_type, 
    TRUE ~ "Other"
  ))
  

Use the code below to check that prop_type_simplified was correctly made.

listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))        

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

  • What are the most common values for the variable minimum_nights?
  • Is ther any value among the common values that stands out?
  • What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having created a dataframe listings with all AirbnB listings in Bordeaux, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, cleaning_fee, guests_included, and extra_people to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

  • Interpret the coefficient review_scores_rating in terms of price_4_nights.
  • Interpret the coefficient of prop_type_simplified in terms of price_4_nights.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanantory variables in model1 plus room_type.

Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights?
  2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
  3. Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?
  4. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
  5. What is the effect of cancellation_policy on price_4_nights, after we control for other variables?

Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

  1. Check the residuals, using autoplot(model_x)

  2. As you start building models with more explanatory variables, make sure you use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 5 or 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.

  3. Create a summary table, using huxtable (https://bit-2021.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.

  4. Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.

  5. If you used a log(price_4_nights) model, make sure you anti-log to convert the value in $. To interpret variables that are log-transformed, please have a look at FAQ HOW DO I INTERPRET A REGRESSION MODEL WHEN SOME VARIABLES ARE LOG TRANSFORMED?

Deliverables

  • By midnight on Monday 19 Oct 2020, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown that incoprorates code and text to introduce, frame, and describe your story and findings.

Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates.

Rubric

We will use a basic checklist (adapted from the book Elements of Data Analytic Style) when reviewing data analyses. It can be used as a guide during the process of a data analysis, as a rubric for grading data analysis projects, or as a way to evaluate the quality of a reported data analysis. You don’t have to answer every one of these questions for every data analysis, but they are a useful set of ideas to keep in the back of your mind when reviewing a data analysis.

Answering the question

  • Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
  • Did you define the metric for success before beginning?
  • Did you understand the context for the question and the business application?
  • Did you consider whether the question could be answered with the available data?

Checking the data

  • Did you plot univariate and multivariate summaries of the data?
  • Did you check for outliers? How did you handle outliers?
  • Did you identify the missing data code?

Tidying the data

  • Is each variable one column?
  • Is each observation one row?
  • Did you record the steps for moving from raw to tidy data?
  • Did you record all parameters, units, and functions applied to the data?

Exploratory analysis

  • Did you identify missing values?
  • Did you make univariate plots (histograms, density plots, boxplots)?
  • Did you consider correlations between variables (scatterplots)?
  • Did you check the units of all data points to make sure they are in the right range?
  • Did you try to identify any errors or miscoding of variables?
  • Did you consider plotting on a log scale?
  • Would a scatterplot be more informative?

Inference

  • Did you identify what large population you are trying to describe?
  • Did you clearly identify the quantities of interest in your model?
  • Did you consider potential confounders?
  • Did you identify and model potential sources of correlation such as measurements over time or space?
  • Did you calculate a measure of uncertainty for each estimate?

Prediction

  • Did you identify in advance your error measure?
  • Did you chck for colinearity in your model(s)?

Written analyses

  • Did you describe the question of interest?
  • Did you describe the data set, experimental design, and question you are answering?
  • Did you specify the type of data analytic question you are answering?
  • Did you specify in clear notation the exact model you are fitting?
  • Did you explain on the scale of interest what each estimate and measure of uncertainty means?
  • Did you report a measure of uncertainty for each estimate?

Figures

  • Does each figure communicate an important piece of information or address a question of interest?
  • Do all your figures include plain language axis labels?
  • Is the font size large enough to read?
  • Does every figure have a detailed caption that explains all axes, legends, and trends in the figure?

Presentations

  • Did you lead with a brief, understandable to everyone statement of your problem?
  • Did you explain the data, measurement technology, and experimental design before you explained your model?
  • Did you explain the features you will use to model data before you explain the model?
  • Did you make sure all legends and axes were legible from the back of the room?

Reproducibility

  • Did you avoid doing calculations manually?
  • Did you create a script/Rmd that reproduces all your analyses?
  • Did you save the raw and processed versions of your data?
  • Did you record all versions of the software you used to process the data?
  • Did you try to have someone else run your analysis code to confirm they got the same answers?

Acknowledgements