In this project, I used R to analyze and visualize the factors, including location, price, response status, host experience and availability, that may affect the chance of receiving host housing reviews on Airbnb aiming to better optimize the hosts’ housing lists.
- Defined problems and scopes
- Collected and manipulated the data
- Analyzed and visualized the results using R
- Wrote the final report
Motivation: Less discussion on unpopular lists
Despite there are many analyses on the overall Airbnb usage and how Airbnb impacts the local community, these explorations mostly focused on the popular activities instead of lists with less or no reviews. I saw some discussion about whether an Airbnb property with zero prior reviews trustworthy. In this post, people shared their strategies that can help users evaluate the trustworthiness of a place in the absence of reviews. I was curious about whether some of these factors mentioned reflect the real review status.
Analyzed and visualized the factors that influence the chance to get housing reviews aiming to better optimize the hosts’ housing lists.
How we got the insights
I was particularly interested in learning more about why some hosts don’t receive any review. The four specific questions I intended to explore for this dataset originally were
- Location & Price: how does the review status distribute across the New York City borough? What is the average price by the New York City borough and review status?
- Response status: what are the response time and response rate for different review status?
- Host experience: how does the host experience influence the number of reviews received?
- Availability: how the availability of a housing list is different from the review status?
For data collection, I gathered free data sources from the community or government's open data. Then, I used R to manipulate, analyze data and visualized the results.
Challenge #1: Translated the qualitative meaning into quantitative criteria
Defined the meaning in a reasonable quantitative way, for example, I didn’t want the lists that were no longer maintained.
💪 What did I do?
I defined “active housing list” as the number of available calendar day in upcoming 90 days must larger than 0.
Challenge #2: Identified patterns from overwhelming and overlapping points
The initial plan for the location-based question was to visualize the location of the real data points. However, when I visualized the map, I found there are too many overlapped data points which made the distribution hard to identify.
💪 What did I do?
I reduced the data points by grouping by zip code. I used R package zipcode to get the geographical position of New York zip code and merged it with the main aggregated dataset by zip code.
Challenge #3: Handled missing, incomplete, noisy and misaligned data
💪 What did I do?
For example, some "host_since" date were recorded after the date of receiving the first review. This non-reasonable data made the waiting days become an invalid negative number, so 35 invalid rows were removed. Besides, I had to figure out how to transform the geographic coordinates into a correct type of map projection.
Challenge #4: Choice of visualization or analysis
What’s the best data structure and visualization to answer a specific question?
This project is valuable because first-time host and less experienced host who doesn’t receive reviews may wonder what are the key variables may influence the chance to get reviews. By knowing these potential factors, hosts who have no review could figure out a better strategy to optimize their lists.