Case Study: Housing Price Prediction on Zillow’s Data

7 min readSep 25, 2023

Introduction

Welcome to the final article in our three-part series on leveraging Zillow’s housing data for various analytical and predictive tasks. In the first article, we understood the importance and applications of web scraping along with some tools, such as Bright Data’s scraping browser. The second article walked you through various preprocessing steps to clean, filter, and prepare your data for machine learning.

In this final post, we will explore, build, and evaluate a predictive model using the preprocessed housing data from Zillow. By the end of this article, you will gain an end-to-end understanding of how to approach a real-world data science project.

Introduction to Machine Learning

Machine learning is artificial intelligence that enables computers to learn from data. Unlike traditional rule-based systems, machine learning systems improve their performance automatically with more data. There are many types of machine learning, but for this article, we will focus on supervised learning, which involves learning a function that maps inputs to outputs based on the input-output pairs shown during the training process.

We will use Python’s sci-kit-learn library, which offers various algorithms and is widely used for machine learning tasks.

Exploring the Pre-processed Dataset

Univariate Analysis

Univariate analysis gives us a summary and insights into each variable in the dataset. It is the simplest form of data analysis, focusing on only one variable at a time.

Histograms, box plots, and summary statistics are commonly used for numerical variables, while frequency tables and bar plots are employed for categorical variables.

import seaborn as sns
# For numerical variables - "price" (variable of interest or target variable)
sns.histplot(df['price'], kde=True)
plt.title('Distribution of Price')
plt.show()

The distribution of house prices shows that there are quite a few houses with zero marked prices, which could be missing values replaced by zero.

Let’s remove them before we proceed.

df = df[df['price']!=0.0]

Let’s plot the distribution once again.

Most houses are priced around a half million to a million dollars, while the most expensive houses imply the order of two million dollars and more.

Let’s understand the distribution of home types using a bar chart.

sns.countplot(data=df, x='homeType')
plt.title('Distribution of Home Types')
plt.xlabel('Home Type')
plt.ylabel('Count')
plt.xticks(rotation = 45)
plt.show()

Most homes are single-family homes, reiterating the number of bedrooms distribution.

Bivariate Analysis

Bivariate analysis examines the relationship between two variables and can be helpful in hypothesis testing.

Scatter plots are often used to understand the association between the variables, where both are of a numerical nature. In contrast, box plots or violin plots are frequently used to analyze the relationship between a categorical and a numerical variable.

Let’s first understand how the price of the house varies by its living area; the bigger the house, the more expensive it should be. This relationship can be understood using a scatterplot as both the quantities are numeric.

# Numerical vs. Numerical
sns.scatterplot(x='livingArea', y='price', data=df)
plt.title('Living Area vs. Price')
plt.show()

We can also observe the relationship between “Home Type” and “Price” using a boxplot below. Here, “Home Type” is categorical, and “Price” is of numerical type.

# Categorical vs. Numerical
sns.boxplot(x='homeType', y='price', data=df)
plt.title('Home Type vs. Price')
plt.show()

Multivariate Analysis

Multivariate analysis aims to understand how variables interact. This type of analysis is beneficial when working with complex datasets where multiple variables might interact in unpredictable ways.

For numerical variables, a correlation plot could be a good starting point. It is a summary statistic of the direction of the relationship between two variables.

# Heatmap for numerical variables
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

We observe a very high correlation between “price” and “zestimate” & “zestimateMinus30”. Further, there is a mild correlation of “price” with other “estimate” columns. These columns probably result from a model trying to identify the home’s value.

Also, remove ‘livingAreaValue’ and ‘yearBuilt’ as redundant columns.

Let’s remove them, as these could be data leakage.

# Drop highly price-correlated columns (mostly added by Zillow models)
cols_to_drop = ['zestimate', 'rentZestimate', 'zestimateMinus30',
 'restimateMinus30', 'zestimateLowPercent',
 'zestimateHighPercent', 'restimateLowPercent', 'zestimate_history',
 'hideZestimate', 'restimateHighPercent',
 'livingAreaValue', 'yearBuilt']
df.drop(columns=cols_to_drop, errors='ignore', inplace=True)

The correlation matrix now looks like the one below.

To understand this relationship in detail, you can use a pair plot, which provides a visual representation of the nature of this relationship.

# Pairplot for selected variables
selected_vars = ['price', 'bedrooms', 'bathrooms', 'livingArea']
sns.pairplot(df[selected_vars])
plt.show()

These are some basic EDA techniques that one can apply based on the dataset’s nature and the business problem. You can also use more advanced techniques and plots to understand the data better.

EDA provides valuable insights that help us make more informed decisions during model building and validation!

Constructing a Predictive Model

We’ll use a Linear Regression algorithm to predict housing prices (note that ‘price’ is the target variable).

Here is how to do it:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# One-hot encode the categorical columns
categorical_columns = ['homeStatus', 'homeType'] 
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
X = df.drop(['price'], axis=1) # Features
y = df['price'] # Target variable
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Initialize the model
model = LinearRegression()
# Fit the model
model.fit(X_train, y_train)

Model Evaluation

A lower MAPE (Mean Absolute Percentage Error) value usually signifies a better fit, though it is crucial to consider the model’s complexity and the problem’s specifics.

For example, the output is

import numpy as np
from sklearn.metrics import mean_absolute_percentage_error

# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mape = mean_absolute_percentage_error(y_test, y_pred)
print(f"The model's MAPE is: {mape}%")

The model’s MAPE is: 3.18%

A MAPE value of 3.18% is considered relatively low i.e., the model is quite accurate for predicting house prices.

Bright Data’s datasets are thus super-useful as they provide data in a structured and easy-to-use form while retaining all the possible relevant details from the original source.

Understanding the Benefits of Real-Time Analysis

Before we wrap up, it’s imperative to reflect on the broader impact of the methods and tools discussed in this article. Beyond mere theory, one might wonder how mastering these web scraping techniques and understanding the challenges reshape our approach to data-driven decisions?

Real Estate Agents: In today’s digital age, staying updated with real-time property market trends is no longer a luxury — it’s a necessity. Real estate agents can be empowered by employing the strategies and tools highlighted in this article. They’re equipped to spot emerging property markets, offering their clients the best deals and insights that are miles ahead of competitors who still rely on dated methods.
Investors: Investors thrive on accurate, timely information. Traditional methods often lag, missing out on golden investment windows. With the techniques learned from this article, investors can ensure they’re always a step ahead. They can make more informed and lucrative investment choices by identifying undervalued regions or understanding the potential of specific property types.

Whether you’re a seasoned professional in the property market or an enthusiast, mastering these methods ensures you’re better positioned to navigate the ever-evolving digital landscape.

The Indispensable Value of Web Data

Before we conclude this series, it’s essential to underscore the unparalleled value web data offers in today’s business milieu, irrespective of whether the enterprise operates predominantly online or offline.

A Modern-Day Business Catalyst: Web data is no longer just a technical asset; it has cemented itself as a cornerstone in our contemporary ecosystem. It is a compass guiding businesses toward more informed and strategic decisions.
Staying Ahead of the Curve: In the swift currents of the digital age, stagnation equates to regression. Companies or individuals that don’t recognize and harness the potential of web data stand at the precipice of obsolescence, risking their relevance and competitiveness.

Summary

Throughout this three-part series, we embarked on a comprehensive journey across the data science continuum — from the nuances of data collection and preprocessing to the intricacies of exploratory data analysis and predictive modeling. We hope you emerge with a robust understanding of how these techniques seamlessly integrate into real-world contexts, catalyzing tangible impact.

The strategies and code elaborated upon across these articles lay down a solid foundation, paving the way for delving into even more complex analytics projects, including but not limited to recommendation systems and time-series forecasting.

Our sincere gratitude for accompanying us on this enlightening voyage. May the insights from this series empower your future endeavors, and we hope our discourse has been both illuminating and of immense value to you.