EDA Case Study: House Price

Task Description

House Prices is a classical Kaggle competition. The task is to predicts final price of each house. For more detail, refer to https://www.kaggle.com/c/house-prices-advanced-regression-techniques/.

Goal of this notebook

As it is a famous competition, there exists lots of excelent analysis on how to do eda and how to build model for this task. See https://www.kaggle.com/khandelwallaksya/house-prices-eda for a reference. In this notebook, we will show how dataprep.eda can simply the eda process using a few lines of code.

In conclusion: * Understand the problem. We’ll look at each variable and do a philosophical analysis about their meaning and importance for this problem. * Univariable study. We’ll just focus on the dependent variable (‘SalePrice’) and try to know a little bit more about it. * Multivariate study. We’ll try to understand how the dependent variable and independent variables relate. * Basic cleaning. We’ll clean the dataset and handle the missing data, outliers and categorical variables.

Import libraries

[1]:
from dataprep.eda import plot
from dataprep.eda import plot_correlation
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)

Load data

[2]:
houses = load_dataset("house_prices_train")
houses.head()
[2]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

[3]:
houses_test = load_dataset("house_prices_test")
houses_test.head()
[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

[4]:
houses.shape
[4]:
(1460, 81)

There are total 1460 tuples, each tuple contains 80 features and 1 target value.

[5]:
houses_test.shape
[5]:
(1459, 80)

Variable identification

[6]:
plot(houses)
[6]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 81
Number of Rows 1460
Missing Cells 6965
Missing Cells (%) 5.9%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 3.9 MB
Average Row Size in Memory 2.7 KB
Variable Types
  • Numerical: 27
  • Categorical: 53
  • GeoGraphy: 1
Dataset Insights
Id is uniformly distributed Uniform
LowQualFinSF and 3SsnPorch have similar distributions Similar Distribution
LowQualFinSF and MiscVal have similar distributions Similar Distribution
3SsnPorch and MiscVal have similar distributions Similar Distribution
ScreenPorch and MiscVal have similar distributions Similar Distribution
LotFrontage has 259 (17.74%) missing values Missing
Alley has 1369 (93.77%) missing values Missing
BsmtQual has 37 (2.53%) missing values Missing
BsmtCond has 37 (2.53%) missing values Missing
BsmtExposure has 38 (2.6%) missing values Missing
Dataset Insights
BsmtFinType1 has 37 (2.53%) missing values Missing
BsmtFinType2 has 38 (2.6%) missing values Missing
FireplaceQu has 690 (47.26%) missing values Missing
GarageType has 81 (5.55%) missing values Missing
GarageYrBlt has 81 (5.55%) missing values Missing
GarageFinish has 81 (5.55%) missing values Missing
GarageQual has 81 (5.55%) missing values Missing
GarageCond has 81 (5.55%) missing values Missing
PoolQC has 1453 (99.52%) missing values Missing
Fence has 1179 (80.75%) missing values Missing
Dataset Insights
MiscFeature has 1406 (96.3%) missing values Missing
MSSubClass is skewed Skewed
LotFrontage is skewed Skewed
LotArea is skewed Skewed
OverallQual is skewed Skewed
YearBuilt is skewed Skewed
YearRemodAdd is skewed Skewed
MasVnrArea is skewed Skewed
BsmtFinSF1 is skewed Skewed
BsmtFinSF2 is skewed Skewed
Dataset Insights
TotalBsmtSF is skewed Skewed
2ndFlrSF is skewed Skewed
LowQualFinSF is skewed Skewed
TotRmsAbvGrd is skewed Skewed
WoodDeckSF is skewed Skewed
OpenPorchSF is skewed Skewed
EnclosedPorch is skewed Skewed
3SsnPorch is skewed Skewed
ScreenPorch is skewed Skewed
MiscVal is skewed Skewed
Dataset Insights
MoSold is skewed Skewed
Street has constant length 4 Constant Length
Alley has constant length 4 Constant Length
LotShape has constant length 3 Constant Length
LandContour has constant length 3 Constant Length
Utilities has constant length 6 Constant Length
LandSlope has constant length 3 Constant Length
OverallCond has constant length 1 Constant Length
ExterQual has constant length 2 Constant Length
ExterCond has constant length 2 Constant Length
Dataset Insights
BsmtQual has constant length 2 Constant Length
BsmtCond has constant length 2 Constant Length
BsmtExposure has constant length 2 Constant Length
BsmtFinType1 has constant length 3 Constant Length
BsmtFinType2 has constant length 3 Constant Length
HeatingQC has constant length 2 Constant Length
CentralAir has constant length 1 Constant Length
BsmtFullBath has constant length 1 Constant Length
BsmtHalfBath has constant length 1 Constant Length
FullBath has constant length 1 Constant Length
Dataset Insights
HalfBath has constant length 1 Constant Length
BedroomAbvGr has constant length 1 Constant Length
KitchenAbvGr has constant length 1 Constant Length
KitchenQual has constant length 2 Constant Length
Fireplaces has constant length 1 Constant Length
FireplaceQu has constant length 2 Constant Length
GarageFinish has constant length 3 Constant Length
GarageCars has constant length 1 Constant Length
GarageQual has constant length 2 Constant Length
GarageCond has constant length 2 Constant Length
Dataset Insights
PavedDrive has constant length 1 Constant Length
PoolQC has constant length 2 Constant Length
MiscFeature has constant length 4 Constant Length
YrSold has constant length 4 Constant Length
MasVnrArea has 861 (58.97%) zeros Zeros
BsmtFinSF1 has 467 (31.99%) zeros Zeros
BsmtFinSF2 has 1293 (88.56%) zeros Zeros
BsmtUnfSF has 118 (8.08%) zeros Zeros
2ndFlrSF has 829 (56.78%) zeros Zeros
LowQualFinSF has 1434 (98.22%) zeros Zeros
Dataset Insights
GarageArea has 81 (5.55%) zeros Zeros
WoodDeckSF has 761 (52.12%) zeros Zeros
OpenPorchSF has 656 (44.93%) zeros Zeros
EnclosedPorch has 1252 (85.75%) zeros Zeros
3SsnPorch has 1436 (98.36%) zeros Zeros
ScreenPorch has 1344 (92.05%) zeros Zeros
MiscVal has 1408 (96.44%) zeros Zeros
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

Overview of the data

We could get the following information: * Variable-Variable name * Type-There are 43 categorical columns and 38 numerical columns. * Missing value-How many missing values each column contains. For instance, Fence contains 80.8% * 1460 = 1180 missing tuples. Usually, some model does not allow the input data contains missing value such as SVM, we have to clean the data before we utilize it. * Target Value-The distribution of target value (SalePrice). According to the distribution of the target value, we could get the information that the target value is numerical and the distribution of the target value conforms to the norm distribution. Thus, we are not confronted with imbalanced classes problem. It is really great. * Guess-According to the columns’ name, we reckon GrLivArea, YearBuilt and OverallQual are likely to be correlated to the target value (SalePrice).

Correlation in data

[7]:
plot_correlation(houses, "SalePrice")
[7]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
[8]:
plot_correlation(houses, "SalePrice", value_range=[0.5, 1])
[8]:
DataPrep.EDA Report
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot
'height': 400
Height of the plot
'width': 400
Width of the plot

OverallQual, GrLivArea, GarageCars, GarageArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd have more than 0.5 Pearson correlation with SalePrice.

OverallQual, GrLivArea, GarageCars, YearBuilt, GarageArea, FullBath, TotalBsmtSF, GarageYrBlt, 1stFlrSF, YearRemodAdd, TotRmsAbvGrd and Fireplaces have more than 0.5 Spearman correlation with SalePrice.

OverallQual, GarageCars, GrLivArea and FullBath have more than 0.5 KendallTau correlation with SalePrice.

EnclosedPorch and KitchenAbvGr have little negative correlation with target variable.

These can prove to be important features to predict SalePrice.

Heatmap

[9]:
plot_correlation(houses)
[9]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)

In summary

In my opinion, this heatmap is the best way to get a quick overview of features’ relationships.

At first sight, there are two red colored squares that get my attention. The first one refers to the ‘TotalBsmtSF’ and ‘1stFlrSF’ variables, and the second one refers to the ‘GarageX’ variables. Both cases show how significant the correlation is between these variables. Actually, this correlation is so strong that it can indicate a situation of multicollinearity. If we think about these variables, we can conclude that they give almost the same information so multicollinearity really occurs. Heatmaps are great to detect this kind of situations and in problems dominated by feature selection, like ours, they are an essential tool.

Another thing that got my attention was the ‘SalePrice’ correlations. We can see our well-known ‘GrLivArea’, ‘TotalBsmtSF’, and ‘OverallQual’, but we can also see many other variables that should be taken into account. That’s what we will do next.

[10]:
plot_correlation(houses[["SalePrice","OverallQual","GrLivArea","GarageCars",
                  "GarageArea","GarageYrBlt","TotalBsmtSF","1stFlrSF","FullBath",
                  "TotRmsAbvGrd","YearBuilt","YearRemodAdd"]])
[10]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation 0.0 0.0 0.0
Lowest Correlation 0.096 0.177 0.131
Mean Correlation 0.442 0.461 0.364
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageYrBlt, YearBuilt)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageYrBlt, YearBuilt)
  • Most negative correlated: None
  • Least correlated: (TotRmsAbvGrd, YearBuilt)

As we saw above there are few feature which shows high multicollinearity from heatmap. Lets focus on red squares on diagonal line and few on the sides.

SalePrice and OverallQual

GarageArea and GarageCars

TotalBsmtSF and 1stFlrSF

GrLiveArea and TotRmsAbvGrd

YearBulit and GarageYrBlt

We have to create a single feature from them before we use them as predictors.

[11]:
plot_correlation(houses, value_range=[0.5, 1])
[11]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)
[12]:
plot_correlation(houses, k=30)
[12]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.882 0.891 0.839
Highest Negative Correlation -0.495 -0.574 -0.413
Lowest Correlation 0.0 0.0 0.0
Mean Correlation 0.096 0.098 0.078
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (GarageCars, GarageArea)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (Id, GarageYrBlt)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (YearBuilt, GarageYrBlt)
  • Most negative correlated: (BsmtFinSF1, BsmtUnfSF)
  • Least correlated: (BsmtUnfSF, KitchenAbvGr)

Attribute Pair Correlation

7 (GarageArea, GarageCars) 0.882475

11 (GarageYrBlt, YearBuilt) 0.825667

15 (GrLivArea, TotRmsAbvGrd) 0.825489

18 (1stFlrSF, TotalBsmtSF) 0.819530

19 (2ndFlrSF, GrLivArea) 0.687501

9 (BedroomAbvGr, TotRmsAbvGrd) 0.676620

0 (BsmtFinSF1, BsmtFullBath) 0.649212

2 (GarageYrBlt, YearRemodAdd) 0.642277

24 (FullBath, GrLivArea) 0.630012

8 (2ndFlrSF, TotRmsAbvGrd) 0.616423

1 (2ndFlrSF, HalfBath) 0.609707

4 (GarageCars, OverallQual) 0.600671

16 (GrLivArea, OverallQual) 0.593007

23 (YearBuilt, YearRemodAdd) 0.592855

22 (GarageCars, GarageYrBlt) 0.588920

12 (OverallQual, YearBuilt) 0.572323

5 (1stFlrSF, GrLivArea) 0.566024

25 (GarageArea, GarageYrBlt) 0.564567

6 (GarageArea, OverallQual) 0.562022

17 (FullBath, TotRmsAbvGrd) 0.554784

13 (OverallQual, YearRemodAdd) 0.550684

14 (FullBath, OverallQual) 0.550600

3 (GarageYrBlt, OverallQual) 0.547766

10 (GarageCars, YearBuilt) 0.537850

27 (OverallQual, TotalBsmtSF) 0.537808

20 (BsmtFinSF1, TotalBsmtSF) 0.522396

21 (BedroomAbvGr, GrLivArea) 0.521270

26 (2ndFlrSF, BedroomAbvGr) 0.502901

This shows multicollinearity. In regression, “multicollinearity” refers to features that are correlated with other features. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your target variable, but also to each other.

Problem:

Multicollinearity increases the standard errors of the coefficients. That means, multicollinearity makes some variables statistically insignificant when they should be significant.

To avoid this we can do 3 things:

Completely remove those variables Make new feature by adding them or by some other operation. Use PCA, which will reduce feature set to small number of non-collinear features. Reference:http://blog.minitab.com/blog/understanding-statistics/handling-multicollinearity-in-regression-analysis

Univariate Analysis

How 1 single variable is distributed in numeric range. What is statistical summary of it. Is it positively skewed or negatively.

[13]:
plot(houses, "SalePrice")
[13]:
DataPrep.EDA Report

Overview

Approximate Distinct Count663
Approximate Unique (%)45.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size22.8 KB
Mean180921.1959
Minimum34900
Maximum755000
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum34900
5-th Percentile88000
Q1129975
Median163000
Q3214000
95-th Percentile326100
Maximum755000
Range720100
IQR84025

Descriptive Statistics

Mean180921.1959
Standard Deviation79442.5029
Variance6.3111e+09
Sum2.6414e+08
Skewness1.8809
Kurtosis6.5098
Coefficient of Variation0.4391
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice is skewed right (γ1 = 1.8809)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice is not normally distributed (p-value 2.792946582081966e-06)
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • SalePrice has 61 outliers
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
140000 20
 
1.4%
135000 17
 
1.2%
145000 14
 
1.0%
155000 14
 
1.0%
110000 13
 
0.9%
190000 13
 
0.9%
115000 12
 
0.8%
160000 12
 
0.8%
130000 11
 
0.8%
139000 11
 
0.8%
Other values (653) 1323
90.6%

Pivotal Features

[14]:
plot_correlation(houses, "OverallQual", "SalePrice")
[14]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 400
Width of the plot
[15]:
plot(houses, "OverallQual", "SalePrice")
[15]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 0.36
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[16]:
plot(houses, "GarageCars", "SalePrice")
[16]:
DataPrep.EDA Report
'box.ngroups': 15
Maximum number of groups to display
'.box.sort_descending': True
Whether to sort the boxes in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
'line.ngroups': 10
Maximum number of groups to display
'line.sort_descending': True
Whether to sort the groups in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
[17]:
plot(houses, "Fireplaces", "SalePrice")
[17]:
DataPrep.EDA Report
'box.ngroups': 15
Maximum number of groups to display
'.box.sort_descending': True
Whether to sort the boxes in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
'line.ngroups': 10
Maximum number of groups to display
'line.sort_descending': True
Whether to sort the groups in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
[18]:
plot(houses, "GrLivArea", "SalePrice")
[18]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 212.32
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[19]:
plot(houses, "TotalBsmtSF", "SalePrice")
[19]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 244.4
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot
[20]:
plot(houses, "YearBuilt", "SalePrice")
[20]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 5.52
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot

In summary

Based on the above analysis, we can conclude that:

‘GrLivArea’ and ‘TotalBsmtSF’ seem to be linearly related with ‘SalePrice’. Both relationships are positive, which means that as one variable increases, the other also increases. In the case of ‘TotalBsmtSF’, we can see that the slope of the linear relationship is particularly high. ‘OverallQual’ and ‘YearBuilt’ also seem to be related with ‘SalePrice’. The relationship seems to be stronger in the case of ‘OverallQual’, where the box plot shows how sales prices increase with the overall quality. We just analysed four variables, but there are many other that we should analyse. The trick here seems to be the choice of the right features (feature selection) and not the definition of complex relationships between them (feature engineering).

That said, let’s separate the wheat from the chaff.

Missing Value Imputation

Missing values in the training data set can affect prediction or classification of a model negatively.

Also some machine learning algorithms can’t accept missing data eg. SVM, Neural Network.

But filling missing values with mean/median/mode or using another predictive model to predict missing values is also a prediction which may not be 100% accurate, instead you can use models like Decision Trees and Random Forest which handle missing values very well.

Some of this part is based on this kernel: https://www.kaggle.com/bisaria/house-prices-advanced-regression-techniques/handling-missing-data

[21]:
plot_missing(houses)
[21]:
DataPrep.EDA Report

Missing Statistics

Missing Cells6965
Missing Cells (%)5.9%
Missing Columns19
Missing Rows1460
Avg Missing Cells per Column85.99
Avg Missing Cells per Row4.77
'height': 500
Height of the plot
'width': 500
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
[22]:
# plot_missing(houses, "BsmtQual")
basement_cols=['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','BsmtFinSF1','BsmtFinSF2']
houses[basement_cols][houses['BsmtQual'].isnull()==True]
[22]:
BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 BsmtFinSF1 BsmtFinSF2
17 NaN NaN NaN NaN NaN 0 0
39 NaN NaN NaN NaN NaN 0 0
90 NaN NaN NaN NaN NaN 0 0
102 NaN NaN NaN NaN NaN 0 0
156 NaN NaN NaN NaN NaN 0 0
182 NaN NaN NaN NaN NaN 0 0
259 NaN NaN NaN NaN NaN 0 0
342 NaN NaN NaN NaN NaN 0 0
362 NaN NaN NaN NaN NaN 0 0
371 NaN NaN NaN NaN NaN 0 0
392 NaN NaN NaN NaN NaN 0 0
520 NaN NaN NaN NaN NaN 0 0
532 NaN NaN NaN NaN NaN 0 0
533 NaN NaN NaN NaN NaN 0 0
553 NaN NaN NaN NaN NaN 0 0
646 NaN NaN NaN NaN NaN 0 0
705 NaN NaN NaN NaN NaN 0 0
736 NaN NaN NaN NaN NaN 0 0
749 NaN NaN NaN NaN NaN 0 0
778 NaN NaN NaN NaN NaN 0 0
868 NaN NaN NaN NaN NaN 0 0
894 NaN NaN NaN NaN NaN 0 0
897 NaN NaN NaN NaN NaN 0 0
984 NaN NaN NaN NaN NaN 0 0
1000 NaN NaN NaN NaN NaN 0 0
1011 NaN NaN NaN NaN NaN 0 0
1035 NaN NaN NaN NaN NaN 0 0
1045 NaN NaN NaN NaN NaN 0 0
1048 NaN NaN NaN NaN NaN 0 0
1049 NaN NaN NaN NaN NaN 0 0
1090 NaN NaN NaN NaN NaN 0 0
1179 NaN NaN NaN NaN NaN 0 0
1216 NaN NaN NaN NaN NaN 0 0
1218 NaN NaN NaN NaN NaN 0 0
1232 NaN NaN NaN NaN NaN 0 0
1321 NaN NaN NaN NaN NaN 0 0
1412 NaN NaN NaN NaN NaN 0 0

All categorical variables contains NAN whereas continuous ones have 0. So that means there is no basement for those houses. we can replace it with ‘None’.

[23]:
for col in basement_cols:
    if 'FinSF'not in col:
        houses[col] = houses[col].fillna('None')
[24]:
# plot_missing(houses, "FireplaceQu")
houses["FireplaceQu"] = houses["FireplaceQu"].fillna('None')
pd.crosstab(houses.Fireplaces, houses.FireplaceQu)
[24]:
FireplaceQu Ex Fa Gd None Po TA
Fireplaces
0 0 0 0 690 0 0
1 19 28 324 0 20 259
2 4 4 54 0 0 53
3 1 1 2 0 0 1
[25]:
garage_cols=['GarageType','GarageQual','GarageCond','GarageYrBlt','GarageFinish','GarageCars','GarageArea']
houses[garage_cols][houses['GarageType'].isnull()==True]
[25]:
GarageType GarageQual GarageCond GarageYrBlt GarageFinish GarageCars GarageArea
39 NaN NaN NaN NaN NaN 0 0
48 NaN NaN NaN NaN NaN 0 0
78 NaN NaN NaN NaN NaN 0 0
88 NaN NaN NaN NaN NaN 0 0
89 NaN NaN NaN NaN NaN 0 0
... ... ... ... ... ... ... ...
1349 NaN NaN NaN NaN NaN 0 0
1407 NaN NaN NaN NaN NaN 0 0
1449 NaN NaN NaN NaN NaN 0 0
1450 NaN NaN NaN NaN NaN 0 0
1453 NaN NaN NaN NaN NaN 0 0

81 rows × 7 columns

All garage related features are missing values in same rows. that means we can replace categorical variables with None and continuous ones with 0.

[26]:
for col in garage_cols:
    if houses[col].dtype==np.object:
        houses[col] = houses[col].fillna('None')
    else:
        houses[col] = houses[col].fillna(0)