Executive Summary

The Boston Housing Market dataset is ubiquitous but imperfect: with problems like small size, inconsistent definitions, incorrect coordinates and many many. However, it is still a very rich dataset containing informative geographical information, powerful socioeconomic indicators, and continuous levels of Nitrogen Oxides (NOx). This project explores the effect of developing low income neighbourhoods on NOx. This involves three logical steps: 1) Verifying that the dataset is rich enough to form clusters of economic class, 2) train a regressor for predicting NOx values, and finally 3) creating synthetic data simulating ‘improved’ low income neighbourhoods by bootstrapping values from higher income classes, while keeping geographical constraints fixed. To address the first issue, K-means is used to cluster the towns into 3 income classes (low, medium and high). The second point is achieved through a Support Vector Regression model achieving an accuracy of 88%. The last issue is addressed by means of replacing non-geographically constrained attributes of the low income cluster with that of a bootstrapped sample from the high inocome cluster. The evidence suggests that improving low income neighbourhoods does indeed decrease overall NOx levels, giving non-humanitarian reasons for supporting social uplifting policy. This project also corrects erroneous longitude and latitude values of the Boston dataset using Google’s geocoder API. The code and documentation for this project can be found here.
The aims and objectives of the project were defined as follows:

Explore novel avenues in famous Boston dataset

Typical projects with the Boston dataset focus on regression analyses on house prices, while others focus on a more geo-spatial analysis of the data (e.g. clustering). While both of these are quite interesting, our goal was to challenge ourselves by trying to answer a research question that could directly to policy recommendations.

Verify that dataset is informative to solve question

An important assumption our question makes is that the data is informative enough to create clusters of areas separated by income. Success Criteria: find clusters that are comparable to external literature.

Find the best model for predicting NOx levels

The project involves finding a regresion model to predict NOx levels, as well as choosing the best regressors for this. Success Criteria: find models that give high accuracy and normally distributed residuals.

Simulate 'development' in Low-income towns

This is the 'novel' part of this project. What does this mean that a town has been 'developed'? Are any parameters constrained, for example by geography? Are there socioeconomic parameters that could be changed by policy recommendations? Success Criteria: find a reasonable way to simulate 'development' in a low income town.





Results

A gallery of relevant plots

Design Flowchart

Showing model selection logic

Raw Data

Before transormation/normalization

Correlations #1

Pearon Correlation Matrix

Income Clusters

Red to yellow represents low to high class

Correlations #2

PhiK (includes categorical variables) Correlation Matrix

Scatter plots

Against 4/5 regressors (ordinal RAD excluded)

RAD 'scatter' plot

A plot showing predicted vs actual RAD vs. NOx values

Elbow plot

Visualisation of different K's

All data normalized

Boxplots and histograms


Conclusions

Future Work



To view the entire project, the reader is referred to the report and the GitHub repository

team

The MechEng Defectors

Yousef Nami

Kyriacos Theocharides