Analyzing Two Decades of U.S. Real Estate Sales: A Python Visualization Approach

17 min readApr 27, 2024

Background

For many years, the U.S. real estate market has served as both a barometer of the country’s economic health and a source of personal wealth. But in order to really comprehend the mechanics of the market, one needs to delve into the data that powers it. The focus dataset, which runs from 2001 to 2021, offers a rich tapestry of data that captures changes in society, policy, and emerging trends in both urban and rural environments, in addition to economic cycles.

Analyzing real estate sales data over the last 20 years reveals a pattern of ups and downs, recovery, and expansion. Property values increased at an extraordinary rate in the early 2000s, leading up to the 2008 housing bubble crash. In the years that followed, there was a gradual but steady return to form as well as rehabilitation. This dataset, which charts the trajectory of real estate sales and values through turbulent times, offers the unadulterated story of these times.

Author can precisely extract insights by using Python for data analysis and visualization. Python provides a method for processing massive amounts of data efficiently thanks to its many modules and tools. On the other side, visualization gives this data life, making intricate correlations and trends understandable and available to a wider audience.

This investigation goes beyond academic bounds. It has real-world ramifications for investors, legislators, sellers, and prospective buyers. We may learn a lot about the towns that are becoming major centers of activity, the changes in the most desired property kinds, and the underlying causes of the market by looking at this data. In-depth knowledge of the property tax environment is also made possible by this analysis, which is important for community planning and local governments.

Objectives

Several main goals serve as the basis for this extensive examination of US real estate sales from 2001 to 2021. The author hopes to utilizing the analytical capabilities of Python and the clarifying influence of visualization

Trend Analysis (2001–2021): Examine the general patterns that have emerged in the real estate industry during the past 20 years. The market’s evolution through several economic cycles, including the 2008 financial crisis and the recovery that followed, will be shown in this long-term study, offering insights on the trajectory of real estate values and sales.
Monthly Trend Dynamics: Seasonality is ingrained in the real estate sector. Examine the monthly patterns that influence the market to determine the seasonal effects on sales volume and property values that may help determine the best approaches for purchasing and selling.
Top 10 Towns by Sales Volume: Use cumulative sales over the previous 20 years to determine which towns stand out in the real estate market. Understanding the dynamism and economic activity of the local market might be aided by identifying these hotspots.
Top 10 Towns by Tax Contributions: The financial importance of real estate within towns is reflected in tax assessment data. To throw light on the fiscal contributors and help readers comprehend the relationship between real estate values and municipal revenues, the author will rank the top ten municipalities based on total taxes.
Distribution of Property Types: A diversity of property types are woven into the American fabric of life. Analyze the distribution of sold property types to get insight into Americans’ evolving lifestyle choices and habits.
Correlation Analysis: A story of cause and effect can be found in the complex dance between sales volume, assessed taxes, and the passing of time. Through the investigation, possible future market behaviors will be predicted by identifying patterns in the correlation between these factors.

By pursuing these goals, the author hopes to provide readers with a grounded viewpoint that can help them forecast the future of the U.S. real estate market in addition to helping them comprehend its past and present. The analysis aims to provide insights that will help everyone, from policymakers to prospective homeowners, make better educated decisions in the ever-changing real estate market.

Dataset

The dataset that serves as the foundation for this analysis is an extensive compilation of US property sales records that were painstakingly compiled between 2001 and 2021. This dataset provides a comprehensive overview of market trends and value indicators by incorporating a variety of real estate transaction characteristics and attributes.

Serial number: special identification for every real estate transaction that guarantees accurate data processing and analysis.
List Year: The year the property was put up for sale; this information is useful for comparing market changes from year to year.
Date Recorded: The particular day on which the transaction was registered with the authorities, facilitating accurate time-series analysis and comprehension of monthly trends.
Town: The property’s location’s name, which gives the study a geographical component.
Address: The property’s precise address, which provides each data point with a concrete location inside the large United States map.
Assessed Value: The property’s value as determined by local tax authorities; this information is essential for comprehending tax patterns and how properties are thought to be worth over time.
Sale Amount: The property’s actual sale price, which takes into account both the buyer’s willingness to pay and the property’s true market value.
Sales Ratio: A computed field that shows the proportion of the sale price to the appraised value. This ratio can be a useful gauge of the state of the market.
Property Type: This shows the variety of real estate in the dataset by classifying the property as Residential, Commercial, Industrial, or other sorts.
Residential Type: A more thorough categorization of real estate that makes distinctions between, among other things, single-family homes, apartments, and multi-family dwellings.

Many statistical studies are possible because the dataset contains both continuous and categorical variables. The author will be able to compute averages, medians, and other statistical metrics thanks to the continuous variables, including “assessed value” and “sale amount.” On the other hand, mode calculations and frequency analyses will be supported by categorical variables such as “Town” and “Property Type.”

The dataset, comprising more than 700,000 records with numerous variables, offers a strong basis for comprehensive statistical modeling. The data represents a wide range of market categories and ranges from tiny towns to large metropolis. This diversity guarantees a comprehensive evaluation of the U.S. real estate market, encompassing everything from the quiet alleys of suburban and rural villages to the busy streets of metropolitan areas.

With the use of this dataset, the author hopes to gain insight into the fundamental causes that influence property values and sales, as well as navigate the ups and downs of the real estate market throughout time. By concentrating on aggregated patterns and insights, this will be accomplished while protecting the privacy and confidentiality of individual property owners.

By means of meticulous data curation, cleansing, and analysis, the dataset will unveil the complex story of the US real estate industry. It serves as evidence of the complex relationship that the author will carefully analyze and visualize in order to reveal the relationship between place, time, and value.

EDA

Loading the Data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv('/Users/anpabelt/Downloads/Real_Estate_Sales_2001-2021_GL 2.csv')
data

Drop Columns

data.drop(columns=['Non Use Code', 'Assessor Remarks', 'OPM remarks', 'Location'], inplace=True)
data

Replace the Null Values

data['Property Type'] = data['Property Type'].fillna('Unknown')
data['Residential Type'] = data['Residential Type'].fillna('Unknown')
data['Address'] = data['Address'].fillna('Unknown')
data['Date Recorded'] = data['Date Recorded'].fillna('Unknown')

data.isna().sum()

data

Change Data Types

data['Assessed Value'] = data['Assessed Value'].astype(int)
data['Sale Amount'] = data['Sale Amount'].astype(int)
data

Parse Date to Month

data['Date Recorded'] = pd.to_datetime(data['Date Recorded'], format="%m/%d/%Y", errors='coerce')
data['Month'] = data['Date Recorded'].dt.month

data

Change Month Data Types

data['Month'] = data['Month'].astype('Int64')

data

Group by Town

data_groupby_town = data.groupby(["Town"]).agg({
    "Assessed Value": "sum", 
    "Sale Amount": "sum",
    "Sales Ratio": "mean"
}).reset_index()

data_groupby_town

Visualization and Insight

1. Trend Analysis (2001–2021)

# Group the data by 'List Year' and calculate the sum of 'Sale Amount' for each year.
sales_year = data.groupby('List Year').agg({"Sale Amount": "sum"}).reset_index()

# Plotting the line chart for visualizing the trend of total sales amounts year-on-year.
plt.figure(figsize=(15, 10))  # Set the figure size to 15x10 inches for better visibility.
plt.plot(sales_year['List Year'], sales_year['Sale Amount'], marker='o')  # Plot with 'List Year' on x-axis and 'Sale Amount' on y-axis with circle markers.
plt.title('Yearly Sum of Sale Amounts')  # Title of the chart indicating it displays yearly sale amounts.
plt.xlabel('Year')  # X-axis label indicating the years.
plt.ylabel('Sum of Sale Amount')  # Y-axis label indicating the summed sale amount for each year.
plt.grid(True)  # Enable gridlines for better readability of the chart.
plt.xticks(range(2001, 2022))  # Set x-axis ticks to cover the range of years from 2001 to 2021.
plt.show()  # Display the chart.

The Federal Reserve’s decision to decrease interest rates in an effort to boost the economy following the recession of 2001 may have contributed to the 2002 sales boom. As borrowing becomes more affordable due to lower interest rates, real estate activity usually increases and home sales rise. The expansion observed during this time frame is consistent with the real estate bubble, in which speculative investments and easy credit conditions caused a significant rise in property values and sales. The inflated housing market during this time was caused by both legitimate demand and speculative investment.

The significant drop in sales is consistent with the housing bubble burst, a major factor in the Great Recession. Falling house prices, foreclosures, and tighter lending requirements all caused major blows to the real estate industry, resulting in a sharp decline in the volume of transactions. The sluggish revival of the real estate market, aided by improving economic conditions, resurrected consumer confidence, and persistently low loan rates, is shown in the sales growth that has occurred since 2009. The nation’s recovery was not uniform, with some regions recovering more quickly than others.

Low unemployment, persistently low interest rates, and steady economic growth are all responsible for the growth that was observed up until 2020. Millennials entering the home market contributed to the surge in demand for real estate during this time. The exceptional circumstances of the COVID-19 epidemic, in which the real estate market flourished in spite of the economic slump, may have a major impact on the peak in 2021. The demand for additional living space as individuals stayed home more often, the persistence of low mortgage rates, and notable movement patterns — such as shifting from urban to suburban or rural areas — were among the contributing factors.

2. Monthly Trend Dynamics

# Group the data by 'Month' and sum up the 'Sale Amount' for each month
sales_month = data.groupby('Month').agg({"Sale Amount": "sum"}).reset_index()

# Plotting the line chart to visualize the trend of sales amounts by month
plt.figure(figsize=(10, 6))  # Set the figure size for better visibility
plt.plot(sales_month['Month'], sales_month['Sale Amount'], marker='o')  # Plot with months on x-axis and sales on y-axis with circle markers
plt.title('Monthly Sum of Sale Amounts')  # Add a title to the chart
plt.xlabel('Month')  # Label the x-axis with 'Month'
plt.ylabel('Sum of Sale Amount')  # Label the y-axis with 'Sum of Sale Amount'
plt.grid(True)  # Add gridlines for better readability
plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Okt', 'Nov', 'Dec'])  # Set custom x-axis ticks to show month abbreviations
plt.show()  # Display the chart

Because of the colder weather and post-holiday financial strain, which are not ideal for real estate transactions, January and February usually see lower sales volumes to start the year. Due to a decrease in buyers seeking to relocate and sellers possibly delaying their listing until spring, this time of year is typically the slowest in the real estate market. March saw a sharp increase in sales, which peak in May and continue to rise through May. This increase is a result of the typical spring purchasing season, when more properties are listed and purchasers are eager to buy because of the nicer weather, the chance to move during the summer break from school, and the possibility of using tax refunds for down payments.

June is perhaps the busiest month for home purchases, as many families want to move and settle in before the new school year begins. Longer daylight hours and pleasant weather often make relocating and house searching easier.

A number of variables, such as market saturation from the spring spike and the start of the vacation season, could be to blame for the decline in sales in the late summer. During this time, many families may decide to temporarily put off buying a home. A discernible rebound occurs, culminating in October after beginning in September. This may be the consequence of a secondary market frenzy as buyers and sellers scramble to close deals before year’s end in order to file taxes, or just to get settled before winter arrives.

December’s drop is typical of the holiday season, when the market is less active for both sellers and buyers. The holidays divert a lot of prospective home buyers, and unfavorable weather in many regions of the nation also plays a role in the decline in home sales.

3. Top 10 Towns by Sales Volume

# Sort the data by 'Sale Amount' in descending order to get the towns with the highest sales on top.
data_groupby_town = data_groupby_town.sort_values(by="Sale Amount", ascending=False)

# Initialize the matplotlib figure with a specified figure size.
plt.figure(figsize=(12, 8))

# Set the title for the bar plot.
plt.title("Top 10 Town with the Most Sum Sales")

# Create a color palette with a gradient effect
palette = sns.color_palette("Blues_r", n_colors=10)

# Plot the barplot using Seaborn, only displaying the top 10 towns with the highest sales.
# 'head(10)' ensures that only the top 10 rows from the DataFrame are displayed.
sns.barplot(data=data_groupby_town.head(10), x="Town", y="Sale Amount", palette=palette)


# Rotate the x-axis labels to 90 degrees to avoid overlapping of town names.
plt.xticks(rotation=90)

# Adjust the layout to make sure everything fits well within the plot area without any cropping.
plt.tight_layout()

# Display the plot.
plt.show()

These towns have the highest sales volume, which may be explained by their closeness to important economic hubs like New York City. Particularly Greenwich is renowned for its affluent residents and opulent real estate, which makes it a top choice for high-value real estate transactions. Stamford is a burgeoning corporate area that probably draws real estate investment because of its new developments and job prospects. Norwalk and Westport gain from their attraction to commuters seeking suburban living close to urban hubs because they are parts of the Greater New York area. Home prices and sales may increase as a result of the appealing mix of convenient city living and cozy suburban living.

Like Norwalk and Westport, these communities appeal to families thinking about real estate investments because of their high standard of living, excellent educational system, and community services. Their place in the graph indicates a robust local real estate market. Rich residents and opulent homes characterize New Canaan, while Danbury’s diversified community and more reasonably priced options add to the real estate market’s diversity. The thriving real estate markets in both locations may be a reflection of distinct housing segments, ranging from upscale to more reasonably priced homes.

4. Top 10 Towns by Tax Contributions

# Sort the DataFrame by 'Assessed Value' in descending order to find the towns with the highest values.
data_groupby_town = data_groupby_town.sort_values(by="Assessed Value", ascending=False)

# Initialize a figure with a size of 12 inches wide by 8 inches tall to accommodate the bar plot.
plt.figure(figsize=(12, 8))

# Set the title of the plot to 'Top 10 Town with the Most Sum Tax'.
plt.title("Top 10 Town with the Most Sum Tax")

# Create a color palette with a gradient effect
palette = sns.color_palette("Blues_r", n_colors=10)

# Create a bar plot using Seaborn to display the top 10 towns.
# The 'head(10)' function limits the data to the top 10 entries after sorting.
sns.barplot(data=data_groupby_town.head(10), x="Town", y="Assessed Value", palette=palette)

# Rotate the x-axis labels to 90 degrees for better readability, especially useful if town names are long.
plt.xticks(rotation=90)

# Apply 'tight_layout' to adjust the plot parameters so that everything fits into the figure area.
plt.tight_layout()

# Display the plot on the screen.
plt.show()

Greenwich has the highest property tax assessed value, dominating the chart. This is consistent with its standing as one of the wealthiest towns in the country, renowned for its considerable real estate investments and high property values. The market value of the properties and the property tax rates are probably reflected in the high assessed valuations in this case. Stamford’s high assessed value is second only to Greenwich’s, possibly because of its combination of affluent residential neighborhoods and commercial real estate. Due to the area’s importance as a corporate hub and expanding residential market, tax assessments are higher due to the high property prices.

The reasonably high assessed prices of these communities point to a healthy real estate market. Both Bridgeport, one of the biggest cities in Connecticut, and Norwalk, with its mixture of business and residential properties, have diversified real estate portfolios that support their tax bases. The comparable assessed valuations of these municipalities point to steady real estate markets. Due to their strong local facilities, family-friendly atmosphere, and high standard of living, these neighborhoods are able to retain their property values and, consequently, their tax assessments.

Even still, these towns are near the bottom of the ranking with respect to large tax assessments. Although Danbury has more affordable housing alternatives and has a bigger volume of real estate contributing to its tax base than New Canaan and Darien, these communities are recognized for their premium houses and affluent populations.

5. Distribution of Property Types

# Count the occurrences of each property type in the 'Property Type' column.
prop_type_variety = data['Property Type'].value_counts()

# Convert the Series obtained from value_counts to a DataFrame for easy plotting with Seaborn.
prop_type_counts = prop_type_variety.reset_index()
prop_type_counts.columns = ['Property Type', 'Count']  # Rename columns for clarity.

# Initialize a matplotlib figure with a set size for the upcoming bar plot.
plt.figure(figsize=(12, 8))

# Create a bar plot with Seaborn, where the x-axis is 'Property Type' and the y-axis is 'Count'.
sns.barplot(x='Property Type', y='Count', data=prop_type_counts)

# Add a title to the plot and label the axes.
plt.title('Distribution of Property Types')
plt.xlabel('Property Type')
plt.ylabel('Count')

# Rotate the property type labels on the x-axis by 90 degrees to prevent overlap and ensure they are readable.
plt.xticks(rotation=90)

# Adjust the plot layout to ensure everything is fitted well and no labels are cut off.
plt.tight_layout()

# Display the bar plot.
plt.show()

The fact that single-family dwellings dominate by a wide margin suggests that they are the most prevalent property type in our dataset in terms of count. The high concentration of single-family dwellings in the area under study may be due to a societal preference or increased demand for this kind of real estate.

Lower counts for residential, condos, and two-family homes indicate that these are less prevalent but yet important property types. When we talk about residential properties, we may be referring to a wider range of property kinds than only single-family, condo, and two-family residences.

The least frequent real estate types are four-family, apartment, industrial, and public utility properties; even less common are three-family, vacant land, and commercial buildings. There could be a number of reasons for the comparatively low quantities of various property kinds, such as local economic conditions, zoning laws, or market demand.

In order to better understand the local real estate market, real estate developers, city planners, and prospective investors may find this chart’s insights regarding housing patterns and composition useful.

6. Correlation Analysis

# Isolate the columns of interest for correlation analysis.
data_corr = data[['List Year', 'Assessed Value', 'Sale Amount']]

# Initialize the matplotlib figure with a square size for better display of the heatmap.
f, ax = plt.subplots(figsize=(9, 9))

# Set the title of the heatmap.
ax.set_title('Year, Sales, and Tax Correlation')

# Create a heatmap using Seaborn to display the correlation matrix.
# 'annot=True' annotates the heatmap with the correlation coefficients.
# 'robust=True' makes the colormap more informative about the distribution of values, especially for outliers.
# 'linewidths=.2' adds space between the heatmap cells.
# 'fmt='.2f'' formats the annotation to two decimal places.
sns.heatmap(data_corr.corr(), annot=True, robust=True, linewidths=.2, fmt= '.2f', ax=ax)

# Show the plot.
plt.show()

As might be predicted, the heatmap displays a diagonal of 1.00, representing the correlations between each variable and itself, which are always perfect. The evaluated value and list year have a very slight positive link (r = 0.03), as seen by the correlation. This low correlation coefficient indicates that the year a property is listed and its assessed value have little to no linear relationship.

Similarly, the association between the sale amount and the list year is much less, at 0.01. This suggests that there is very little correlation between the year a property is listed and the price at which it sells.

Conversely, there is a 0.12 association between the sale amount and the assessed value. This positive correlation is marginally larger than the previous two, but it is still deemed weak, suggesting that there may be some relationship at all between the evaluated valuations of properties and the amounts sold for them. This link, however, is not strong enough to imply a direct or trustworthy relationship that might precisely forecast sale amounts based on assessed values.

The heatmap as a whole indicates that the year properties are listed, their assessed values, and the selling amounts have little to no direct linear link. This could suggest that other variables that aren’t shown in the heatmap have a bigger impact on a property’s sale price.

Conclusion

A thorough grasp of the dynamics of the real estate market across time and dimensions has been formed thanks to the collection of data visualizations offered. Real estate sales clearly exhibit seasonality, as seen by the annual and monthly trends, which show peaks in the spring and early summer and troughs in the late fall and winter. This yearly pattern demonstrates how outside variables like the climate, holidays, and business cycles affect real estate activity.

Towns like Greenwich and Stamford lead in both sales and tax assessments. A closer look at the top towns with the highest total of sales and tax assessments indicates a concentration of real estate activity and value in particular locations. This implies a relationship between these communities’ respective property values and market activity. This knowledge is further enhanced by the distribution of property types within the market, which demonstrates the predominance of single-family homes and may suggest a market with a family-oriented demography or a predilection for this type of property.

Last but not least, the correlation heatmap provides a more nuanced perspective by demonstrating that the year properties are listed, their assessed values, and their final sale amounts have very little linear link. This implies that although there are trends in the real estate market, more variables than just the assessed value and listing year need to be taken into consideration when doing predictive analysis of property prices and sales.

To sum up, the real estate market is defined by intricate relationships between preference for certain types of properties, seasonality, regional concentration, and a host of other variables that are not highly connected with assessed values and listing years. This suggests that while making investment decisions, setting policies, or forecasting the market, real estate market stakeholders need to take a wide range of factors into account.

References

Dataset: https://www.kaggle.com/datasets/utkarshx27/real-estate-sales-2001-2021-gl