Unveiling Hotel Pricing in Yerevan: A Data Science Journey
Introduction
Have you ever wondered what factors influence the price of hotel rooms? As a data enthusiast, I embarked on a journey to uncover the secrets behind hotel pricing in Yerevan, Armenia. By diving deep into a dataset of hotel attributes and customer ratings, I aimed to discover the key drivers of hotel prices and provide insights that can benefit both hotel managers and travelers.
About Dataset
The dataset used in this analysis includes various attributes of hotels in Yerevan, such as star ratings, customer reviews (staff, facilities, location, comfort, cleanliness), and the presence of amenities like free parking, fitness centers, spas, and airport shuttles. The primary goal was to analyze how these factors correlate with the price per day of a hotel room.
Key Questions
To structure the analysis, I focused on the following key questions:
- What is the distribution of hotel prices in Yerevan?
- How do customer ratings correlate with hotel prices?
- Which amenities influence hotel prices the most?
- How does the star rating of hotels affect their pricing?
- What are the most significant factors affecting hotel prices?
Let’s dive into the Analysis
Step 1. Importing required libraries
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearnfrom sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2. Reading Data
# Location of Data
path = r'Yerevan-Hotels.csv'
YH_Data = pd.read_csv(path)YH_Data.info() # information about dataYH_Data.head() # Shows only first five rows of the dataYH_Data.tail() # Shows only last five rows of the dataYH_Data # Shows full dataYH_Data.columns # State the column namesYH_Data.shape # Gives the number of rows and columns in the dataYH_Data.dtypes # Gives the type of data entries in each column
Step 3. Cleaning and Preprocessing Data
YH_Data.isnull().sum() # To check for null values
# Converting the required data typesYH_Data['Star Rating'] = pd.to_numeric(YH_Data['Star Rating'], errors = 'coerce')
YH_Data['Rating'] = pd.to_numeric(YH_Data['Rating'], errors = 'coerce')
YH_Data['Staff'] = pd.to_numeric(YH_Data['Staff'], errors = 'coerce')
YH_Data['Facilities'] = pd.to_numeric(YH_Data['Facilities'], errors = 'coerce')
YH_Data['Location'] = pd.to_numeric(YH_Data['Location'], errors = 'coerce')
YH_Data['Comfort'] = pd.to_numeric(YH_Data['Comfort'], errors = 'coerce')
YH_Data['Cleanliness'] = pd.to_numeric(YH_Data['Cleanliness'], errors = 'coerce')YH_Data.info() # After CleaningYH_Data.describe() # Statistical Analysis of dataYH_Data.corr() # CorrelationYH_Data.cov() # Covariance# Convert Categorical columns with binary values to numerical Cat_Col = ['Free Parking', 'Fitness Centre', 'Spa and Wellness Centre', 'Airport Shuttle']for col in Cat_Col:
YH_Data[col] = YH_Data[col].apply(lambda x : 1 if x == 'Yes' else 0)YH_Data # Full data after processingYH_Data.dropna(inplace = True) # Dropping the missing valuesYH_Data.shape # The number of rows and columns in the data after dropping the missing values
Step 4. Training a Model to Predict Hotel Prices
To predict hotel prices based on the available data, by training a regression model.
Data Preprocessing
# Feature Selection
Features = ['Star Rating', 'Rating', 'Staff', 'Facilities', 'Location', 'Comfort', 'Cleanliness', 'Free Parking', 'Fitness Centre', 'Spa and Wellness Centre', 'Airport Shuttle']# Split data into Training and Testing setsX = YH_Data[Features]
Y = YH_Data['Price Per Day($)']# Train Test SplitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)# Standardize DataScaler = StandardScaler()X_train = Scaler.fit_transform(X_train)
X_test = Scaler.transform(X_test)
Model Training
# Create Model
Model = RandomForestRegressor(n_estimators = 100, random_state = 42)# Train ModelModel.fit(X_train, Y_train)
Evaluation
# Predictions
Predict = Model.predict(X_test)# Evaluate ModelMAE = mean_absolute_error(Y_test, Predict)
MSE = mean_squared_error(Y_test, Predict)
RMSE = np.sqrt(MSE)
R2 = r2_score(Y_test, Predict)# Printing Resultsprint(f'Mean Absolute Error: {MAE}')
print(f'Mean Squared Error: {MSE}')
print(f'Root Mean Squared Error: {RMSE}')
print(f'R-Squared Score: {R2}')
Feature Importance
# Feature Importance
Importance = Model.feature_importances_
Feature_Names = X.columns
print(Feature_Names)Feature_Importance = pd.DataFrame({'Feature' : Feature_Names, 'Importance' : Importance})
Feature_Importance = Feature_Importance.sort_values(by = 'Importance', ascending = False)# Printing Valuesprint(Feature_Importance)
Step 5. Exploratory Data Analysis {EDA}
a). Feature Importance: To understand the importance of each feature in the prediction, we can visualize the feature importances.
b). Distribution of Hotel Prices: Understanding the distribution of hotel prices provides a foundation for the analysis. Using a histogram with a kernel density estimate (kde) plot, we can visualize the spread of hotel prices, identify common price ranges, and spot any outliers.
c). Customer Ratings Correlation: Understanding how customer ratings correlate with hotel prices. A heatmap of the correlation matrix reveals the relationships between different ratings and price per day.
d). Influence of Amenities on Prices: Amenities play a significant role in determinig hotel prices. By comparing the price distributions for hotels with and without specific amenities, we can identify which amenities are associated with higher prices.
e). Star Rating Vs. Price: Star ratings are a direct indicator of hotel quality and luxury. By visualizing the distribution of prices across different star ratings using box plots, we can see how prices vary with the quality level.
f). Significant Factors Affecting Prices: Highlighting the most significant factors affecting hotel prices using a heatmap of correlation coefficients. This reveals which factors have the strongest influence on pricing decisions.
Insights and Findings
From the model training and visualization, we gained several valuable insights:
- Model Performance: The Random Forest Regressor provided a robust model with reasonable accuracy, and the feature importance analysis highlighted the most influential factors.
- Mean Absolute Error: 28.165624999999995
- Mean Squared Error: 1701.803128125
- Root Mean Squared Error: 41.25291660143559
- R-Squared Score: -0.47330801784222687
- Distribution of Hotel Prices: The histogram shows the distribution of hotel prices in Yerevan, providing insights into the common price ranges and the spread of prices.
- Customer Ratings Correlation: The heatmap shows the correlation between customer ratings and hotel prices. Higher correlation values indicate stronger relationships.
- Influence of Amenities on Prices: The KDE plots show how the presence of different amenities (free parking, fitness center, spa, and airport shuttle) influences hotel prices.
- Effect of Star Rating on Pricing: The box plot shows how the star rating of hotels affects their pricing, with higher star ratings typically corresponding to higher prices.
- Significant Factors Affecting Prices: The heatmap highlights the most significant factors affecting hotel prices, as indicated by higher correlation coefficients with the price.
Conclusion
By leveraging data science techniques, we can uncover the hidden factors that influence hotel prices. These insights can help the hotel managers optimize their pricing strategies and provide better value to their customers. Travelers can also benefit by understanding which factors to consider when booking a hotel.
If you’re interested in the detailed analysis or want to explore the code, check out the project on GitHub!
Link: https://github.com/thesupriyanagpal/Yerevan_Hotels_Reservation_Prices_Analysis
Author: Supriya Nagpal — {Data Scientist and Mathematics Enthusiast}