The Boston Housing Market dataset is ubiquitous but imperfect: with
problems like small size, inconsistent definitions, incorrect coordinates
and many many. However, it is still a very rich dataset containing
informative geographical information, powerful socioeconomic
indicators, and continuous levels of Nitrogen Oxides (NOx). This project explores the effect of developing low income
neighbourhoods on NOx. This involves three logical steps: 1) Verifying
that the dataset is rich enough to form clusters of economic class, 2)
train a regressor for predicting NOx values, and finally 3) creating
synthetic data simulating ‘improved’ low income neighbourhoods by
bootstrapping values from higher income classes, while keeping
geographical constraints fixed. To address the first issue, K-means is used to cluster the towns into 3 income classes (low, medium and high). The second point is achieved through a Support Vector Regression model achieving an accuracy of 88%.
The last issue is addressed by means of replacing non-geographically constrained attributes of the low income cluster with that of a bootstrapped sample from the high inocome cluster. The evidence suggests that improving low income neighbourhoods does
indeed decrease overall NOx levels, giving non-humanitarian reasons for
supporting social uplifting policy. This project also corrects erroneous
longitude and latitude values of the Boston dataset using Google’s
geocoder API. The code and documentation for this project can be found here.
The aims and objectives of the project were defined as follows:
Explore novel avenues in famous Boston dataset
Typical projects with the Boston dataset focus on regression analyses on house prices, while others focus on a more geo-spatial analysis of the data (e.g. clustering). While both of these are quite interesting,
our goal was to challenge ourselves by trying to answer a research question that could directly to policy recommendations.
Verify that dataset is informative to solve question
An important assumption our question makes is that the data is informative enough to create clusters of areas separated by income. Success Criteria: find clusters that are comparable to external literature.
Find the best model for predicting NOx levels
The project involves finding a regresion model to predict NOx levels, as well as choosing the best regressors for this. Success Criteria: find models that give high accuracy and normally distributed residuals.
Simulate 'development' in Low-income towns
This is the 'novel' part of this project. What does this mean that a town has been 'developed'? Are any parameters constrained, for example by geography? Are there socioeconomic parameters that could be changed by policy recommendations?
Success Criteria: find a reasonable way to simulate 'development' in a low income town.
Methodology
The first step of the project was to ensure that the dataset was actually viable for the project. For this Kmeans clustering was used as a first benchmark for comparing with literature. This provided reasonably good results, and thus more invovled clustering methods were not used.
This was compared to a choropleth by Boston University found here.
Data Correction
The team noticed that the latitude and longitude values were incorrect, as the initial plots looked like something found here. These were corrected promptly using Google's Geocoder API.
Features Used
The features used to produce the final clustering method were CMEDV, INDUS, AGE and LSTAT. These features were natually cleaned and transformed (and normalized) to minimize the effect of scale (and to ensure that the distributions have similar variance).
Outcome
The clustering produced 3 classes: low income, medium income and high income dwellings. This would prove useful for the final part of the project which invovled 'developing' a low income town.
The second part of the project involved creating a regression model that can predict NOx levels. For this, many models were implemented, including Linear, Lasso, Ridge, Support Vector regression and neural networks. The best model, considering simplicity and accuracy was the Support Vector Regression model, giving an accuracy of 88%, 10% higher than the next two best models.
Sanity Check
The large accuracy jump is not a surprise: one of the regressors is RAD, an index that is ordinal. It is not surprising that the SVR can pick up on this, whereas the linear models fail.
Verification
The residuals predicted by the model were plotted to determine that the model gives a reasonable output (necessary if bootstrapping is to be used)
Outcome
A regression model capable of predicting NOx levels to 88% accuracy. The features used are split between geographical (DIS, RAD and INDUS) and socioeconomic (CRIM and AGE).
The final part of the project was actually 'developing' an area. To do this, bootstrapping was used. Effectively, the distribution of the high-income cluster was learnt.
The non-geographically constrained (i.e. socioeconomic) of the low-income cluster were replaced with that of the bootstrapped sample, thus 'simulating' improvement.
The improved data points were then fed into the regression model that predicted new NOx levels. The figure to the right shows the improvement by means of colours.
Another experiment was undertaken to explore the impact of changing variables while keeping all else constant. This analysis was limited to policy affected variables (i.e. socioeconomic). The method of partial dependence was used to visualise the effects.
Policy Implications
Our analysis was limited to non-geographical constraints. In real life, this would correspond to implementing policies that affect crime rate and house age. For the latter, this would mean funding to improve old infrastructure in the inner city.
Outcome
The evidence suggests that improving low-income towns leads to an overall decrease in NOx values within them.
A plot showing predicted vs actual RAD vs. NOx values
Elbow plot
Visualisation of different K's
All data normalized
Boxplots and histograms
Conclusions
The evidence suggests that
improving low income neighbourhoods correlates with lower NOx values
The statistical effect is that the distribution of NOx in the
low income neighbourhoods doesn’t change its peak
Based on the research, it appears that CRIM
and AGE are strong humanly-changeable indicators of NOx pollution.
Future Work
causality: could you explore if AGE / CRIM can
be deemed to cause
data augmentation: this work did much in terms of augmenting the longitude
and latitudes, and creating synthetic data for ‘developed’ low income neighbourhoods, future methods
could look to using neural network driven data augmentation techniques (such as GANs)