Exploring Zillow’s Housing Data in Python

6 min readSep 25, 2023

Introduction

In our previous article, we discussed the significance of web scraping, the limitations of the traditional tools that surfaced, and the need to use a modern tool such as Bright Data’s Scraping Browser.

Great, so we have access to the web-scraped data from Zillow. This dataset can be requested on Bright Data’s dashboard under the “Dataset Marketplace” tab.

In this article, we will perform data preprocessing on this real estate data using Python and derive valuable insights.

Additionally, while we go through the nuances of data cleaning and exploratory data analysis (EDA), we’ll briefly introduce Bright Data’s ready-to-use datasets — a resource that every data enthusiast should know.

The Importance of Data Cleaning

Data, especially web-scraped data — by its very nature, has its share of irregularities such as missing values, inconsistencies, outliers, and sometimes even errors. These can potentially skew the results of any analytical exercise. Hence, it is crucial to ensure data quality before jumping into analysis.

Quality Assurance: Raw data can be messy. Cleaning the dataset is the preliminary step to ensure that it is of the highest quality, devoid of errors that might mislead the insights.
Efficiency: A cleaned dataset makes for easier analysis, reducing the time required to derive valuable insights.

Preprocessing and Cleaning the Scraped Data

Depending on the nature of the dataset, there are various strategies to handle missing values, such as imputation, leveraging ML algorithms, or sometimes even considering removing columns if a substantially low percentage of data is available.
Statistical methods, such as the IQR or Z-Score, can identify and handle outliers.
For datasets with variables of different scales, normalization and standardization is particularly crucial.

EDA

EDA is about understanding the key characteristics of the data — its structure, outliers, patterns, and anomalies. It offers a macro view of the data, including overarching patterns, and helps with informed decision-making grounded in evidence and insights.

Statistical and Visual Insights into Housing Data

Descriptive statistics include mean, median, mode, variance, and standard deviation that summarize the dataset’s characteristics quickly.
Histograms, scatter plots, box plots, and heat maps are some of the go-to ways to visualize the data. For instance, a scatter plot might reveal how variables like location and number of bedrooms impact the price of houses on Zillow.
As we dive deeper, we will uncover patterns like seasonality in housing prices or the influence of external factors on price fluctuations.

Time to Hands-On!

Preprocessing is a core component of any data science project, especially in machine learning. It is also the most time-consuming step of the model-building process.

The model’s performance can be improved by exploring various data aspects, such as:

Removing columns that do not add value, such as URLs
Dealing with missing values and visualizing them
Identifying and replacing outliers
Feature engineering to create meaningful variables
Imputing missing values

We’re keeping the operations basic to make it easier for those new to data science or those needing a quick refresher.

Import Libraries

First, let’s import the libraries we’ll need.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the Data

Here, we read the data using Pandas. Adjust the file path accordingly.

# Read the data
df = pd.read_csv("zillow_data.csv", nrows=1000)

Remove columns containing URLs

The first step is to remove any columns that contain URLs since they are generally not useful for machine learning. It is highlighted in the code below as the string that contains “https”.

cols_to_drop = []
for col in df.columns:
 if df[col].dtype == 'object':
 if df[col].astype(str).str.contains('https', case=False, na=False).any():
 cols_to_drop.append(col)
df.drop(columns=cols_to_drop, inplace=True)

Plotting Missing Values

Plot a bar chart to visualize the number of non-missing values for each column. A red line is drawn at 60% of the total values to quickly identify columns with many missing values.

non_missing_values = df.count()
plt.figure(figsize=(15, 10))
plt.bar(non_missing_values.index, non_missing_values.values)
plt.axhline(y=len(df) * 0.6, color='r', linestyle='-')
plt.title('Number of Non-Missing Values in Each Column')
plt.ylabel('Number of Non-Missing Values')
plt.xlabel('Columns')
plt.xticks(rotation=90)
plt.show()

Remove Columns with High Missing Values

Remove columns with more than 40% missing values, equivalent to putting a threshold of 0.6 in the code shown below:

thresh = len(df) * 0.6
df.dropna(thresh=thresh, axis=1, inplace=True)

Replace Outliers

Replace the outlier values with median or mode depending on the column’s data type.

# Step 4: Replace outliers
def replace_outliers(series):
   if series.dtype == 'object':
     mode_val = series.mode()[0]
     return series.fillna(mode_val)
   else:
     Q1 = series.quantile(0.25)
     Q3 = series.quantile(0.75)
     IQR = Q3 - Q1
     median_val = series.median()
     lower_bound = Q1–1.5 * IQR
     upper_bound = Q3 + 1.5 * IQR
     return series.apply(lambda x: median_val if x < lower_bound or x > upper_bound else x)

for col in df.columns:
   df[col] = replace_outliers(df[col])

Feature Engineering and Dropping Unuseful Columns

Here, we generate new features (e.g., ‘home_age’) and drop columns that are not likely to be useful in the model’s learning process.

# Step 5: Feature Engineering and Dropping Unuseful Columns
# Add new features (e.g., age of home)
df['home_age'] = df['yearBuilt'].apply(lambda x: 2023 - x if pd.notna(x)
 else np.nan)
# Drop columns not useful for machine learning
cols_to_drop = [ 'timestamp', 'zpid', 'streetAddress',
 'isListingClaimedByCurrentSignedInUser', 'isCurrentSignedInAgentResponsible', 
'isCurrentSignedInUserVerifiedOwner', 'isVerifiedClaimedByCurrentSignedInUser',
 'listingDataSource', 'hasBadGeocode', 'currency', 'hideZestimate', 
'dateSoldString', 'country', 'ssid', 'description', 'parcelId',
 'taxHistory', 'priceHistory', 'nearbyHomes', 'schools', 'citySearchUrl',
 'mortgageRates', 'isInstantOfferEnabled', 'zillowOfferMarket', 'county',
 'isRentalListingOffMarket', 'nearbyCities', 'nearbyNeighborhoods',
 'nearbyZipcodes', 'abbreviatedAddress', 'daysOnZillow', 'brokerageName',
 'propertyTypeDimension', 'hdpTypeDimension', 'timeZone', 'tourEligibility',
 'photos', 'zestimate_history', 'resoFacts:sewer', 'resoFacts:waterSource',
 'utilities', 'city', 'state', 'address', 'livingAreaUnits', 'selfTour',
 'rentalApplicationsAcceptedType', 'lotAreaUnits', 'livingAreaUnitsShort']
df.drop(columns=cols_to_drop, errors='ignore', inplace=True)

Impute Missing Values in Remaining Columns

Finally, impute the missing values in the remaining columns with mode and median.

# Impute missing values in the remaining columns
for col in df.columns:
   if df[col].dtype == 'object':
     mode_val = df[col].mode()[0]
     df[col].fillna(mode_val, inplace=True)
   else:
     median_val = df[col].median()
     df[col].fillna(median_val, inplace=True)

And there you have it — a preprocessed dataset ready for machine learning.

Bright Data Datasets

Besides providing tools to scrape data from the internet, Bright Data also provides ready-to-use datasets similar to the one we used in this tutorial.

The snippet below shows 11.3 million records with 88 attributes for the Zillow dataset.

Source: https://brightdata.com/cp/datasets

In pursuing the most reliable and actionable datasets, it’s crucial to clarify why we settled on utilizing Bright Data datasets, especially when numerous providers are in the market. Here’s a breakdown of our decision-making process:

Data Freshness: Bright Data stands out when ensuring the most recent data. In a constantly evolving world, data freshness isn’t just a luxury; it’s a necessity.
Accessibility: When it comes to data retrieval, time is often of the essence. The Bright Data datasets are extensive and readily accessible, ensuring we can promptly act on the insights they offer.
Unparalleled Commercial Insights: The information embedded within Bright Data’s datasets isn’t just any data; it’s real-time commercial information. Such data is valuable and often challenging to source from other providers.
Precision in Location-Based Analyses: Our analyses often require current locational data for precise outcomes. The datasets from Bright Data are tailored to provide that specificity, ensuring more accurate results in our location-centric studies.

Furthermore, it’s crucial to emphasize that maintaining datasets as comprehensive and current as Bright Data’s would be a monumental task using traditional methods. Without advanced web scraping tools at our disposal, it would be nearly impossible to replicate the richness and timeliness of the data they provide. Thus, our choice is not just about quality but also about the feasibility of data collection in today’s fast-paced digital landscape.

This article highlights the importance of various steps in the data processing pipeline to extract the maximum value from raw web-scraped data.

Stay tuned for our next article in the series, where we’ll go beyond analysis and venture into predictive modeling!