Introduction

First, import the data:

airbnb <- read.csv("airbnb.csv")

#Convert price data from character to string
airbnb <- airbnb %>% 
  mutate(price = str_remove(price, "\\$"),
         price = str_remove(price, ","),
         price = as.integer(price))

Data integrity checks

Let’s start by performing basic data integrity checks. First, lets check for duplicates.

airbnb %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 23397
airbnb %>% distinct() %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1 23186
airbnb <- airbnb %>% distinct()

There are roughly 200 duplicate rows in the dataset. While it is possible for someone to own two similar properties and list them both, having identical review scores is less plausible. We will remove the duplicates to be safe.

Next, we can search for missing values.

While summary is a little messy (not displayed here), it provides plenty of valuable information for us to digest. First, let’s discuss the missing values. The most striking column is square_feet. Of the 23186, 23043 observations are missing values. Of the actual values, over half of them are set to 0. While the housing market in Toronto is outrageous, 0 square feet is beyond absurd! This is just more missing values. Given that the vast majority of the column is missing, we cannot expect to draw much information from the square footage of a listing. However, we have other columns such as bathrooms, accommodates, and bedrooms that offer similar information. We would expect the true square footage values to be highly correlated with these three columns. Given that they have very few missing values (10, 0, and 25 respectively), this will suffice for providing listing size information. The second set of missing values of interest are those regarding the hosts. Aside from the host_id column, all of the host specific columns are missing exactly 283 values. Note that this excludes the “N/A” entry into the host_response_time column which would appear to represent an actual “N/A” input. Perhaps this is the case where the hosts did not respond at all to guests. Before looking at review scores, let’s confirm whether the 283 NA’s are common between all of the host columns.

airbnb %>% 
  select(starts_with("host")) %>%
  select(-host_id) %>%
  filter_all(is.na) %>% 
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1   283

As we suspected, the 283 NA values are common between host columns. This indicates that there are 283 listings where we know nothing about the hosts other than their host_id which is an identifier as opposed to a characteristic.

Finally, there are NA values in the review scores columns. There seems to be nearly 4500 rows with missing review information. Once again, lets see if these NA values are common between columns.

airbnb %>% 
  select(starts_with("review")) %>%
  filter_all(is.na) %>% 
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1  4485
#airbnb %>%
#  select(contains("review")) %>%
#  filter(!is.na(review_scores_rating)) %>%
#  filter_at(vars(-review_scores_rating, -number_of_reviews), any_vars(is.na(.))) %>%
#  arrange(number_of_reviews)

4485 is the number of rows missing the review_scores_rating. It is a good sign that there are no instances where the rating is missing but the review subcategories are present. In most of the cases where there only certain categories missing, there is only one review. It seems likely that the single reviewer didn’t bother to fill out a complete review. More strange is that there is an instance with 0 reviews but a score of 100. That seems hardly reliable.

airbnb %>% filter(number_of_reviews == 0 & !is.na(review_scores_rating)) %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     1

From the above we do see that this is an isolated incident.

A final strange observation is that the minimum price is 0. It seems odd that an airbnb listing would be free.

airbnb %>% filter(price == 0) %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     4

These 4 observations are clearly rather fishy. An entire home/apartment for free seems like an instance where a 0 was entered as opposed to a NA value. We should remove these observations once we start the analysis phase.

The purpose of these integrity checks was to examine where we may want to consider removing variables or observations due to lack of information. The square_feet columns is clearly not helpful in that far too much of the column is missing. Similarly, the has_availability column is exclusively set to TRUE and thus offers no information to distinguish between listings. Regarding missing values at the observation level, if we plan to rely on the review scores, we must consider eliminating rows that do not contain any review information. The 20 observations with an overall score are more up for debate. They don’t necessarily provide reliable information if there is one review and most of the subcategories are missing. However, if most of the categories are filled, we can consider interpolating the remaining ones. Finally, if we plan to rely on host data, the 283 observations with no host information should be removed. It is also worth mentioning that perhaps two of the most important determinants of price, neighbourhood_cleansed and room_type are not missing any data. In real estate, location is everything and thus we expect the neighbourhood to be a key factor in our analysis. The room_type is even more obvious: an entire house should generally be more expensive than a shared room.

Preprocessing

Before starting EDA, there are a couple of simple filtering and subsetting tasks we can complete such as removing the square_feet and has_availability columns as well as shorting some column names. Also, note that host_listings_count and host_total_listings_count are identical columns. We can remove host_total_listings_count with no loss of information.

airbnb <- airbnb %>%
  select(-square_feet, -has_availability, -host_total_listings_count) %>%
  rename(neighbourhood = neighbourhood_cleansed) %>%
  filter(price != 0) 

For now, we have just removed the observations with bad price data. Given that the columns lacking review and host information do still have a price value, they are useful in assessing the distribution of listing prices and can be removed before considering the relationships between price and the missing covariates.

Exploratory Data Analysis

Price

Recalling our listing price modelling objective, let’s begin by considering the distribution of price and log(price).

airbnb %>% 
  select(price) %>%
  mutate(log.price = log(price)) %>%
  pivot_longer(cols=price:log.price, names_to = "Type", values_to = "Price") %>%
  ggplot(aes(x=Price, group=Type, colour=Type)) +
  geom_density() +
  facet_wrap(~ Type, scales="free") +
  ggtitle("Price Density Plots") +
  theme(legend.position = "none")

The above plot tells us that the price data is heavily concentrated in the under $1000 price range and even moreso in the lower end of that bucket. Noting how dramatically right-skewed the distribution is, it will be better to model the log price instead as it offers a more normal shape.

Neighbourhoods

Earlier on, we conjectured that neighbourhoods would be a defining characteristic of price as certain neighbourhoods are more desirable than others. Now, we can investigate the relationship between neighbourhood and price.

airbnb %>% 
  select(neighbourhood, price) %>%
  mutate(price = log(price)) %>%
  group_by(neighbourhood) %>%
  summarise(min.log.price = min(price), max.log.price = max(price), 
            mean.log.price=mean(price)) %>%
  ggplot(aes(x=mean.log.price, y=neighbourhood)) +
  geom_point() + 
  geom_vline(xintercept=mean(log(airbnb$price))) +
  geom_errorbarh(aes(xmin=min.log.price, xmax=max.log.price)) + 
  scale_y_discrete(limits = rev(levels(airbnb$neighbourhood))) +
  ggtitle("Neighbourhood Log Price Ranges")