上QQ阅读APP看书，第一时间看更新

Linear regression with scikit-learn and higher dimensionality

scikit-learn offers the class LinearRegression, which works with n-dimensional spaces. For this purpose, we're going to use the Boston dataset:

from sklearn.datasets import load_boston

>>> boston = load_boston()

>>> boston.data.shape
(506L, 13L)
>>> boston.target.shape
(506L,)

It has 506 samples with 13 input features and one output. In the following figure, there' a collection of the plots of the first 12 features:

When working with datasets, it's useful to have a tabular view to manipulate data. pandas is a perfect framework for this task, and even though it's beyond the scope of this book, I suggest you create a data frame with the command pandas.DataFrame(boston.data, columns=boston.feature_names) and use Jupyter to visualize it. For further information, refer to Heydt M., Learning pandas - Python Data Discovery and Analysis Made Easy, Packt.

There are different scales and outliers (which can be removed using the methods studied in the previous chapters), so it's better to ask the model to normalize the data before processing it. Moreover, for testing purposes, we split the original dataset into training (90%) and test (10%) sets:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

>>> X_train, X_test, Y_train, Y_test = train_test_split(boston.data, boston.target, test_size=0.1)

>>> lr = LinearRegression(normalize=True)
>>> lr.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

When the original data set isn't large enough, splitting it into training and test sets may reduce the number of samples that can be used for fitting the model. k-fold cross-validation can help in solving this problem with a different strategy. The whole dataset is split into k folds using always k-1 folds for training and the remaining one to validate the model. K iterations will be performed, using always a different validation fold. In the following figure, there's an example with 3 folds/iterations:

In this way, the final score can be determined as average of all values and all samples are selected for training k-1 times.

To check the accuracy of a regression, scikit-learn provides the internal method score(X, y) which evaluates the model on test data:

>>> lr.score(X_test, Y_test)
0.77371996006718879

So the overall accuracy is about 77%, which is an acceptable result considering the non-linearity of the original dataset, but it can be also influenced by the subdivision made by train_test_split (like in our case). Instead, for k-fold cross-validation, we can use the function cross_val_score(), which works with all the classifiers. The scoring parameter is very important because it determines which metric will be adopted for tests. As LinearRegression works with ordinary least squares, we preferred the negative mean squared error, which is a cumulative measure that must be evaluated according to the actual values (it's not relative).

from sklearn.model_selection import cross_val_score

>>> scores = cross_val_score(lr, boston.data, boston.target, cv=7, scoring='neg_mean_squared_error')
array([ -11.32601065,  -10.96365388,  -32.12770594,  -33.62294354,
        -10.55957139, -146.42926647,  -12.98538412])

>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457

Another very important metric used in regressions is called the coefficient of determination or R². It measures the amount of variance on the prediction which is explained by the dataset. We define residuals, the following quantity:

In other words, it is the difference between the sample and the prediction. So the R² is defined as follows:

For our purposes, R² values close to 1 mean an almost perfect regression, while values close to 0 (or negative) imply a bad model. Using this metric is quite easy with cross-validation:

>>> cross_val_score(lr, X, Y, cv=10, scoring='r2')
0.75