House Data Prediction using Python

Marian Vinas
8 min readJun 25, 2020

New home sales is a housing market statistic that measures the sales of newly built homes over a given period. It provides a broad view of activity in the housing market. For example, an increase in new home sales suggests demand is picking up. Because changes are often seen in new home sales before the market at large, new home sales is considered a leading indicator. The statistic is also a sign of the health of the U.S. economy because an increase in new home sales suggests an increase in consumer confidence and spending. The most closely watched data on new home sales is the U.S. Census Bureau’s New Residential Sales, released around the 20th of each month.

Housing Affordability Index (HAI)

One common yardstick is the Housing Affordability Index, published by the National Association of Realtors, a research and lobby group in Washington, D.C. Using data from the Census Bureau and the Federal Housing Finance Agency (a government body tasked with regulating mortgages), the HAI measures the percentage of Americans that can afford the monthly mortgage payments on a median-priced home. The Federal Reserve Bank of San Francisco calls the HAI “a way to track over time whether housing is becoming more or less affordable for the typical household. The HAI incorporates changes in key variables affecting affordability: housing prices, interest rates, and income.”

A value of 100 means that a family with the median-income has just the right amount to purchase a median-priced home. Above 100 means more households can purchase that home. Since 2013, the HAI has been falling, indicating that home prices are rising faster than incomes. But the HAI is still far above averages in recent decades. (See HUD chart below.)

Loading the data

You can download the dataset here.

The first step to any data science project is to import your data. Often, you’ll work with data in Comma Separated Value (CSV) files and run into problems at the very start of your workflow.

You can access my actual prediction in Kaggle link.

df = pd.read_csv('../input/train.csv')

How to get the number of elements:

The shape attribute of pandas.DataFrame stores the number of rows and columns as a tuple (number of rows, number of columns).

(1460, 81)

Feature Importances

Since there are a lot of features in this dataset, we are focusing in Sale Price, Bedrooms, Bathrooms, Fireplace and car garage.

Tareget Variable

The “target variable” is the variable whose values are to be modeled and predicted by other variables. It is analogous to the dependent variable (i.e., the variable on the left of the equal sign) in linear regression. There must be one and only one target variable in a decision tree analysis. The target variable that we can use for this dataset is [‘SalePrice’].

#statistics summary
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64

Now, let’s start cleaning the data

Missing data

Important questions when thinking about missing data:

  • How prevalent is the missing data?
  • Is missing data random or does it have a pattern?

The answer to these questions is important for practical reasons because missing data can imply a reduction of the sample size. This can prevent us from proceeding with the analysis. Moreover, from a substantive perspective, we need to ensure that the missing data process is not biased and hidding an inconvenient truth.

cardinality = df.select_dtypes(exclude='number').nunique()

high_cardinality_feat = cardinality[cardinality > 20].index.tolist()
df = df.drop(columns = high_cardinality_feat)
df = df.fillna('Missing')

train = df[df['YrSold'] <= 2016]
val = df[df['YrSold'] == 2007]
test = df[df['YrSold'] <= 2008]

target = 'Great'
features = train.columns.drop([target, 'YrSold'])
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

pipeline = make_pipeline(
RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
), y_train)
print(f'Validation accuracy: {pipeline.score(X_val, y_val)}')

Validation Accuracy: 1.0

Decision Tree Classifier

Since I used time based split for this dataset

  • Train on reviews from 2016 & earlier.
  • Validate on 2007.
  • Test on 2008 & later.

the code shows like this for your reference:


The main difference between regression and classification is that the output variable in regression is numerical (or continuous) while that for classification is categorical (or discrete).

Let’s review ‘SalePrice’

Scatter plot GrLivArea and SalePrice, this shows that they are close together with a linear relationship.

Let’s try TotalBsmtSF and SalePrice, this shows a strong linear (exponential).

Fitting the Linear Regression model

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

Linear Regression R^2 0.6867240177727985

Fit Gradient Boosting model

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

Gradient Boosting R^2 0.999999995082342

Random Forest Classifier using Pdpbox

I’m using for this is “GarageType” feature.

I tried partial dependence plots with 2 features to show the interaction between 2 features. I choose [‘HouseStyle’, ‘Fireplaces’]

After all the coding and cleaning, Cross-validation MAE shows 34016.79 and Best hyper parameters for max features is .60

Let’s start the prediction

Running X_test.head() shows like this:

So for getting 0th row, actual sale’s price is $208,500 and prediction shows $177,494.

I tried Shapley Values Force Plot for the expected value of a house with 2 car garage, 2 full bath with no fireplace.

My prediction shows that the price for a house that doesn’t have fireplace with 2 car garage and 2 full baths is $177,494.72

$401,573 estimated sale price.

Starting from baseline of $180,964

(GarageCars, 3) $110,828

(Fireplaces, 2) $83,754

(FullBath, 2) $26,027

Increase in sales price shows if you added 2 fireplaces, 1 additional car garage with 2 full baths

$237,754 estimated sale price.

Starting from baseline of $180,964

(GarageCars, 2) $-14,419

(Fireplaces, 2) $49,018

(FullBath, 2) $22,191

Last prediction:

$499,511 estimated sale price.

Starting from baseline of $180,964

(GarageCars, 3) $129,684

(Fireplaces, 2) $78,304]

(FullBath, 3) $110,559

Price increases when added a full bath and 1 car garage


Analyzing house sales price using python makes it easier. Predicting the behavior when adding features shows that price increases. So, if you’re planning to sell or buy a house please see the list that I made below to instantly add value to your home.

1. Add some smart technology

Smart home technology was identified as a top trend in Zillow’s 2019 Design Forecast. If you want to make your home more valuable, then it’s time to start thinking smart. Investing in some smart home technology can increase your home’s value quickly, without the expense of a huge renovation.

2. Remove your carpeting

“Modern home buyers are turned off by carpeting,” says Earl White, founder of House Heroes Realty and House Heroes, LLC. “Laying down a shiny new wood floor across easily pays for itself several times over.” And if you’re worried that authentic flooring is going to cost you an arm and a leg, there are always “good alternatives to traditional wood floors,” says White. “Home renovators can buy engineered or laminate wood for a fraction of the cost of an expensive floor.”

3. Replace your dated garage door

Luckily, replacing it isn’t a hugely pricey investment, and doing so can make your house much more attractive in the long run.”By installing a new garage door, you can see up to a 92 percent return on your investment,” says Palomino. “Garage doors boost curb appeal, provide safety and security, and even help on energy bills if insulated.”

4. Redo your kitchen counters

Those old laminate or tile counters are bringing down your home’s appeal and its potential selling point. Luckily, installing granite or quartz counters is easy and it can quickly improve the look and value of your place. “Replacing counter tops can be completed for under $10,000 and will have a huge impact on value,” says Franklin.

5. Install recessed lighting

Want to make every room in your home look brighter, more beautiful, and more expensive? The answer is simple: recessed lighting.”Living rooms should have four recessed lights, and hallways and kitchens should have a few too,” suggests Brian Dougherty, managing partner at Robert Paul Properties in Boston.

6. Go minimal

As a general rule, ornate fabrics, patterns, and design styles generally won’t yield a major ROI. When you’re looking to improve your home’s value, minimalism yields maximum money. Buyers today are very influenced by the home shows that they see on HGTV and that seems to dictate what many people say they are looking for: neutral colors, clean surfaces, and minimal window treatments.

These are the few tips in upgrading your home if you’re planning to sell your house or buy a house. And always do your research of the location information, schools(if you have kids like me), price insights, home facts, crime rate and median real estate values.