Executive Summary

The Boston Housing Market dataset is ubiquitous but imperfect: with problems like small size, inconsistent definitions, incorrect coordinates and many many. However, it is still a very rich dataset containing informative geographical information, powerful socioeconomic indicators, and continuous levels of Nitrogen Oxides (NOx). This project explores the effect of developing low income neighbourhoods on NOx. This involves three logical steps: 1) Verifying that the dataset is rich enough to form clusters of economic class, 2) train a regressor for predicting NOx values, and finally 3) creating synthetic data simulating ‘improved’ low income neighbourhoods by bootstrapping values from higher income classes, while keeping geographical constraints fixed. To address the first issue, K-means is used to cluster the towns into 3 income classes (low, medium and high). The second point is achieved through a Support Vector Regression model achieving an accuracy of 88%. The last issue is addressed by means of replacing non-geographically constrained attributes of the low income cluster with that of a bootstrapped sample from the high inocome cluster. The evidence suggests that improving low income neighbourhoods does indeed decrease overall NOx levels, giving non-humanitarian reasons for supporting social uplifting policy. This project also corrects erroneous longitude and latitude values of the Boston dataset using Google’s geocoder API. The code and documentation for this project can be found here.

The aims and objectives of the project were defined as follows:

Explore novel avenues in famous Boston dataset

Typical projects with the Boston dataset focus on regression analyses on house prices, while others focus on a more geo-spatial analysis of the data (e.g. clustering). While both of these are quite interesting, our goal was to challenge ourselves by trying to answer a research question that could directly to policy recommendations.

Verify that dataset is informative to solve question

An important assumption our question makes is that the data is informative enough to create clusters of areas separated by income. Success Criteria: find clusters that are comparable to external literature.

Find the best model for predicting NOx levels

The project involves finding a regresion model to predict NOx levels, as well as choosing the best regressors for this. Success Criteria: find models that give high accuracy and normally distributed residuals.

Simulate 'development' in Low-income towns

This is the 'novel' part of this project. What does this mean that a town has been 'developed'? Are any parameters constrained, for example by geography? Are there socioeconomic parameters that could be changed by policy recommendations? Success Criteria: find a reasonable way to simulate 'development' in a low income town.

Methodology

The first step of the project was to ensure that the dataset was actually viable for the project. For this Kmeans clustering was used as a first benchmark for comparing with literature. This provided reasonably good results, and thus more invovled clustering methods were not used. This was compared to a choropleth by Boston University found here.

Data Correction

The team noticed that the latitude and longitude values were incorrect, as the initial plots looked like something found here. These were corrected promptly using Google's Geocoder API.

Features Used

The features used to produce the final clustering method were CMEDV, INDUS, AGE and LSTAT. These features were natually cleaned and transformed (and normalized) to minimize the effect of scale (and to ensure that the distributions have similar variance).

Outcome

The clustering produced 3 classes: low income, medium income and high income dwellings. This would prove useful for the final part of the project which invovled 'developing' a low income town.

The second part of the project involved creating a regression model that can predict NOx levels. For this, many models were implemented, including Linear, Lasso, Ridge, Support Vector regression and neural networks. The best model, considering simplicity and accuracy was the Support Vector Regression model, giving an accuracy of 88%, 10% higher than the next two best models.

Sanity Check

The large accuracy jump is not a surprise: one of the regressors is RAD, an index that is ordinal. It is not surprising that the SVR can pick up on this, whereas the linear models fail.

Verification

The residuals predicted by the model were plotted to determine that the model gives a reasonable output (necessary if bootstrapping is to be used)

Outcome

A regression model capable of predicting NOx levels to 88% accuracy. The features used are split between geographical (DIS, RAD and INDUS) and socioeconomic (CRIM and AGE).

The final part of the project was actually 'developing' an area. To do this, bootstrapping was used. Effectively, the distribution of the high-income cluster was learnt. The non-geographically constrained (i.e. socioeconomic) of the low-income cluster were replaced with that of the bootstrapped sample, thus 'simulating' improvement. The improved data points were then fed into the regression model that predicted new NOx levels. The figure to the right shows the improvement by means of colours.

Another experiment was undertaken to explore the impact of changing variables while keeping all else constant. This analysis was limited to policy affected variables (i.e. socioeconomic). The method of partial dependence was used to visualise the effects.

Policy Implications

Our analysis was limited to non-geographical constraints. In real life, this would correspond to implementing policies that affect crime rate and house age. For the latter, this would mean funding to improve old infrastructure in the inner city.

Outcome

The evidence suggests that improving low-income towns leads to an overall decrease in NOx values within them.

Results

A gallery of relevant plots

Design Flowchart

Showing model selection logic

Raw Data

Before transormation/normalization

Correlations #1

Pearon Correlation Matrix

Income Clusters

Red to yellow represents low to high class

Correlations #2

PhiK (includes categorical variables) Correlation Matrix

Scatter plots

Against 4/5 regressors (ordinal RAD excluded)

RAD 'scatter' plot

A plot showing predicted vs actual RAD vs. NOx values

Elbow plot

Visualisation of different K's

All data normalized

Boxplots and histograms

Conclusions

The evidence suggests that improving low income neighbourhoods correlates with lower NOx values
The statistical effect is that the distribution of NOx in the low income neighbourhoods doesn’t change its peak
Based on the research, it appears that CRIM and AGE are strong humanly-changeable indicators of NOx pollution.

Future Work

causality: could you explore if AGE / CRIM can be deemed to cause
data augmentation: this work did much in terms of augmenting the longitude and latitudes, and creating synthetic data for ‘developed’ low income neighbourhoods, future methods could look to using neural network driven data augmentation techniques (such as GANs)

How will developing low income neighbourhoods in Boston affect NOx levels?

Executive Summary

The aims and objectives of the project were defined as follows:

Explore novel avenues in famous Boston dataset

Verify that dataset is informative to solve question

Find the best model for predicting NOx levels

Simulate 'development' in Low-income towns

Methodology

Data Correction

Features Used

Outcome

Sanity Check

Verification

Outcome

Policy Implications

Outcome

Results

A gallery of relevant plots

Design Flowchart

Raw Data

Correlations #1

Income Clusters

Correlations #2

Scatter plots

RAD 'scatter' plot

Elbow plot

All data normalized

Conclusions

Future Work

To view the entire project, the reader is referred to the report and the GitHub repository

team

The MechEng Defectors

Yousef Nami

Kyriacos Theocharides