Predicting house prices with regression
Let's start with a simple problem, namely, predicting house prices in Boston. The problem is as follows: we are given several demographic and geographical attributes, such as the crime rate or the pupil-teacher ratio in the neighborhood. The goal is to predict the median value of a house in a particular area. As in the case of classification, we have some training data and want to build a model that can be generalized to other data.
This is one of the built-in datasets that scikit-learn comes with, so it is very easy to load the data into memory:
from sklearn.datasets import load_boston
boston = load_boston()
The boston object contains several attributes; in particular, boston.data contains the input data and boston.target contains the price of houses in thousands of dollars.
We will start with a simple one-dimensional regression, trying to regress the price on a single attribute: the average number of rooms per dwelling in the neighborhood. We can use the standard least squares regression method you probably first saw in high school.
Our first attempt looks like this:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
We import LinearRegression from the sklearn.linear_model module and construct a LinearRegression object. This object will behave analogously to the classifier objects from scikit-learn that we used earlier.
The feature we will be using is stored at position 5. For now, we just use a single feature; later in the chapter, we will use all of them. You can consult boston.DESCR and boston.feature_names for detailed information on the data. The boston.target attribute contains the average house price (our target variable):
# Feature 5 is the number of rooms.
x = boston.data[:,5]
y = boston.target
x = np.transpose(np.atleast_2d(x))
lr.fit(x, y)
The only non-obvious line in this code block is the call to np.atleast_2d, which converts x from a one-dimensional to a two-dimensional array. This conversion is necessary as the fit method expects a two-dimensional array as its first argument. Finally, for the dimensions to work out correctly, we need to transpose this array.
Note that we are calling methods named fit and predict on the LinearRegression object, just as we did with classifier objects, even though we are now performing regression. This regularity in the API is one of the nicer features of scikit-learn.
We can easily plot the fit:
from matplotlib import pyplot as plt
fig,ax = plt.subplots()
ax.scatter(x, y)
xmin = x.min()
xmax = x.max()
ax.plot([xmin, xmax],
[lr.predict(xmin), lr.predict(xmax)],
'-', lw=2, color="#f9a602")
ax.set_xlabel("Average number of rooms (RM)")
ax.set_ylabel("House Price")
Refer to the following graph:
The preceding graph shows all the points (as dots) and our fit (the solid line). We can see that visually it looks good, except for a few outliers.
Ideally, though, we would like to measure how good a fit this is quantitatively. This will be critical in order to be able to compare alternative methods. To do so, we can measure how close our prediction is to the true values. For this task, we can use the mean_squared_error function from the sklearn.metrics module:
from sklearn.metrics import mean_squared_error
This function takes two arguments, the true values and the predictions, as follows:
mse = mean_squared_error(y, lr.predict(x))
print("Mean squared error (of training data): {:.3}".format(mse))
Mean squared error (of training data): 43.6
This value can sometimes be hard to interpret, and it's better to take the square root, to obtain the root mean square error (RMSE):
rmse = np.sqrt(mse)
print("RMSE (of training data): {:.3}".format(rmse))
RMSE (of training data): 6.6
One advantage of using RMSE is that we can quickly obtain a very rough estimate of the error by multiplying it by two. In our case, we can expect the estimated price to be different from the real price by, at most, 13,000 dollars.
Root mean squared error corresponds approximately to an estimate of the standard deviation. Since most data is at most two standard deviations from the mean, we can double our RMSE to obtain a rough confident interval. This is only completely valid if the errors are normally distributed, but it is often roughly correct even if they are not.
A number such as 6.6 is still hard to immediately understand without domain knowledge. Is this a good prediction? One possible way to answer this question is to compare it with the most simple baseline, the constant model. If we knew nothing of the input, the best we could do would be to predict that the output will always be the average value of y. We can then compare the mean-squared error of this model with the mean-squared error of the null model. This idea is formalized in the coefficient of determination, which is defined as follows:
In this formula, yi represents the value of the element with index i, while is the estimate for the same element obtained by the regression model. Finally, is the mean value of y, which represents the null model that always returns the same value. This is roughly the same as first computing the ratio of the mean-squared error with the variance of the output and, finally, considering one minus this ratio. This way, a perfect model obtains a score of one, while the null model obtains a score of zero. Note that it is possible to obtain a negative score, which means that the model is so poor that one is better off using the mean as a prediction.
The coefficient of determination can be obtained using the r2_score of the sklearn.metrics module:
from sklearn.metrics import r2_score
r2 = r2_score(y, lr.predict(x))
print("R2 (on training data): {:.2}".format(r2))
R2 (on training data): 0.48
This measure is also called the r2 score. If you are using linear regression and evaluating the error on the training data, then it does correspond to the square of the correlation coefficient, R. However, this measure is more general, and, as we discussed, may even return a negative value.
An alternative way to compute the coefficient of determination is to use the score method of the LinearRegression object:
r2 = lr.score(x,y)