Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Moving towards a standard workflow

Estimators scikit-learn have two and predict(). We train the algorithm using the
predict() method on our testing set. We evaluate it using the predict() method on our testing set.

First, we need to create these training and testing sets. As before, import and run the train_test_split function:

from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)

Then, we import the nearest neighbor class and create an instance for it. We leave the parameters as defaults for now and will test other values later in this chapter. By default, the algorithm will choose the five nearest neighbors to predict the class of a testing sample:

from sklearn.neighbors import KNeighborsClassifier estimator = KNeighborsClassifier()

After creating our estimator, we must then fit it on our training dataset. For the nearest neighbor class, this training step simply records our dataset, allowing us to find the nearest neighbor for a new data point, by comparing that point to the training dataset:

estimator.fit(X_train, y_train)

We then train the algorithm with our test set and evaluate with our testing set:

y_predicted = estimator.predict(X_test) 
accuracy = np.mean(y_test == y_predicted) * 100     
print("The accuracy is {0:.1f}%".format(accuracy))

This model scores 86.4 percent accuracy, which is impressive for a default algorithm and just a few lines of code! Most scikit-learn default parameters are chosen deliberately to work well with a range of datasets. However, you should always aim to choose parameters based on knowledge of the application experiment. We will use strategies for doing this parameter search in later chapters.