Data Overview

housing_data

Data Cleaning

data_cleaning

Create new relevant variables and dropping the variables we will not use

new variables

Preparing the data for clustering – Step 1

clustering step 1

Preparing the data for clustering – Step 2

clustering step 2

SSE curve (We see an elbow at k=4)

sse curve

Data Visualizations

visualization

Visualization between Median House Value by Ocean Proximity

Housevalue by ocean code

From the graph we see that,

Visualization between Median House Value by Median Income

Housevalue by income code

From the graph we can see that:

Visualization between Median House Value by Persons per household

Housevalue by person code

Here, we set the limit of persons per household to 15 to eliminate any outliers. There could be data from a shopping centre or a hotel which might give arbitrary values.

From the graph we can see that:

Visualization between Median House Value by Rooms Per Unit

Housevalue by room code

From the graph we see that:

Visualization between Median Age Group by Ocean Proximity

Age by ocean code

From the graph we can see that:

Visualization between Rooms Per Unit by House value to Income

Room housevalue by income code

House value to income represents the median house value per median income.\ The maximum number of rooms per unit has been limited to 5 to avoid any outliers.

From this graph we can see that,

Linear Modelling (for each cluster)

linear model

Step 1: Checking correlations cluster wise

Correlation - Cluster1

Correlation

Correlation - Cluster2

Correlation 2

Correlation - Cluster3

Correlation 3

Correlation - Cluster4

Correlation 4

Step 2: Checking Significance

Significance Our aim is to predict the median house value for a particular area in California.

Let us check the significance of Median Income, Persons Per Household and Rooms Per Unit on the Median House Value so that we can keep the relevant variables in our linear model.

Significance of Median Income, Persons per Household and Rooms Per Unit on Median House Value for models with Cluster 1 and Cluster 2 respectively

significance 1 significance 2

From the results above we see that:

Significance of Median Income, Persons per Household and Rooms Per Unit on Median House Value for models with Cluster 3 and Cluster 4 respectively

significance 3 significance 4

From the results above we see that:

Step 3 : Run Linear Regression for all the models

Linear model

We thus create a function for linear modelling with persons per households and without persons per households

Finally, we call the function to model all the data frames and check the summary for each model.

Linear Regression -> Cluster 1

summary1

The regression line for this model can be defined as:\ Median house value = (4.770 * Median Income (USD)) + ((-12630) * rooms per unit) + 13050\ Adjusted R square = 0.4627

Linear Regression -> Cluster 2

summary 2

The regression line for this model can be defined as:\ Median house value = (2.610 * Median Income (USD)) + (51110 * rooms per unit) + (-13230 * Persons Per Household + 83551)\ Adjusted R square = 0.5556

Linear Regression -> Cluster 3

summary 3

The regression line for this model can be defined as:\ Median house value = (3.341 * Median Income (USD)) + (1.291 * rooms per unit) + 9864\ Adjusted R square = 0.4938

Linear Regression -> Cluster 4

summary 4

The regression line for this model can be defined as:\ Median house value = (3.438 * Median Income (USD)) + (4284 * rooms per unit) + (-13410 * Persons Per Household + 94970)\ Adjusted R square = 0.5312

Insights/Hypotheses