Airbnb Barcelona Price Prediction
This post documents the steps involved training a simple model to predict the price in Airbnb Barcelona listing, covering following section:
- Data Exploring
- Transformation
- Training the model
- Encode model and upload
The repository is here if you would like to check the code.
Data Exploring
Airbnb data can be obtained from http://insideairbnb.com/get-the-data.html
Here we used the data from Dec 2019 for Barcelona, Spain. You can obtain a copy here
The original data in listings.csv.gz has over 100 columns, the final data frame used for price prediction only had 43 columns including ‘Price’ target. Many of them were dropped because they were mostly no value, or same value, or has high correlation with other columns.
Object Columns
Column | Transformation |
---|---|
host_id | none |
host_response_time | drop_rows_occurs_less_than(1) ->fillna_with_lowest_occurance ->dic_host_response_time |
host_is_superhost | encode_boolean_to_float ->fillna False |
host_verifications | extract_num_of_items_for_column |
host_has_profile_pic | encode_boolean_to_float ->fillna False |
host_identity_verified | encode_boolean_to_float ->fillna False |
neighbourhood_group_cleansed | drop_rows_occurs_less_than(1) ->encode_category_dic |
is_location_exact | encode_boolean_to_float |
property_type | encode_category_dic |
room_type | encode_category_dic |
bed_type | encode_category_dic |
amenities | extract_num_of_items_for_column |
has_availability | encode_boolean_to_float |
instant_bookable | encode_boolean_to_float |
cancellation_policy | drop_rows_occurs_less_than(2) -> encode_category_dic |
require_guest_profile_picture | encode_boolean_to_float |
require_guest_phone_verification | encode_boolean_to_float |
Numeric Columns
All na were filled with mean
Column |
---|
host_response_rate |
latitude |
longitude |
accommodates |
bathrooms |
bedrooms |
beds |
price |
security_deposit |
cleaning_fee |
guests_included |
extra_people |
availability_365 |
number_of_reviews |
review_scores_rating |
review_scores_accuracy |
review_scores_cleanliness |
review_scores_checkin |
review_scores_communication |
review_scores_location |
review_scores_value |
calculated_host_listings_count |
calculated_host_listings_count_entire_homes |
calculated_host_listings_count_private_rooms |
calculated_host_listings_count_shared_rooms |
review_per_month |
Folder structure
airbnb-ml
|-- notebooks # experimental
|-- models # exported models
|-- src
|-- transform.py
|-- const.py
|-- pipeline.py
|-- app.py # flask app for price prediction
|-- __init__.py # makes everything in src folder an package
|-- tests
|-- test_transform.py
|-- Dockerfile # to dockerize prediction app
|-- requirements.txt
|-- requirements-dev.txt
Testing Framework
# requirements-dev.txt
nose==1.3.7
nose-watch==0.9.2
rednose==1.3.0
Transformation
def drop_rows_occurs_less_than(df, column_name, low_bound):
counts = df[column_name].value_counts()
to_remove = counts[counts <= low_bound].index
new_df = df[~df[column_name].isin(to_remove)]
return new_df
def fillna_with_lowest_occurance(df, column_name):
df = df.fillna(
value={column_name: df[column_name].value_counts().index[-1]})
return df
def encode_boolean_to_float(df, column_name):
df[column_name] = df[column_name].astype(float)
return df
def extract_num_of_items_for_column(df, column_name):
df[column_name] = df[column_name].apply(extract_num_of_items)
return df
def extract_num_of_items(input):
input = input.split(",")
return len(input)
Transformation: Categorical Column
Some of the columns contain categorical data, instead of one-hot encoding, I wrote a function to label them
def encode_category_dic(dataframe):
def h(dica, columnName):
labels = dataframe[columnName].astype(
'category').cat.categories.tolist()
encode_dic = {columnName: {k: v for k, v in zip(
labels, list(range(1, len(labels)+1)))}}
return dict(dica, **encode_dic)
return h
def foldleft(func, acc, xs):
return functools.reduce(func, xs, acc)
# usage
category_columns = [
"neighbourhood_group_cleansed",
"property_type",
"room_type",
"bed_type",
"cancellation_policy"
]
category_encoder = trans.encode_category_dic(airbnb)
category_dic = trans.foldleft(category_encoder, {}, category_columns)
airbnb = airbnb.replace(category_dic)
Training the model
Models like SVCRegressor, LinearRegression and DecisionTree produced very bad results. R2 score ranged from 0.09-0.12
In the end, RandomForestRegressor produced the best model, R2 score ~0.8
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
y = airbnb["price"]
X = airbnb.drop(columns=['price'])
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rnd_regr = RandomForestRegressor(
min_samples_leaf=4,
random_state=0,
n_estimators=100)
rnd_regr.fit(x_train, y_train)
Encode model and upload to S3
import s3fs
with s3.open('airbnb-barcelona/models/rnd_reg.price', 'wb') as f:
joblib.dump(rnd_regr, f)