Learning Data Mining with Python(Second Edition)
上QQ阅读APP看书,第一时间看更新

Preprocessing

When taking measurements of real-world objects, we can often get features in different ranges. For instance, if we measure the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have more! more! more!
  • Weight: This is between the ranges of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the possible strategies normalizes the features so that they all have the same range, or the values are turned into categories like small, medium and large. Suddenly, the large differences in the types of features have less of an impact on the algorithm and can lead to large increases in the accuracy.

Pre-processing can also be used to choose only the more effective features, create new features, and so on. Pre-processing in scikit-learn is done through Transformer objects, which take a dataset in one form and return an altered dataset after some transformation of the data. These don't have to be numerical, as Transformers are also used to extract features-however, in this section, we will stick with pre-processing.

We can show an example of the problem by breaking the Ionosphere dataset. While this is only an example, many real-world datasets have problems of this form.

  1. First, we create a copy of the array so that we do not alter the original dataset:
X_broken = np.array(X)
  1. Next, we break the dataset by piding every second feature by 10:
X_broken[:,::2] /= 10

In theory, this should not have a great effect on the result. After all, the values of these features are still relatively the same. The major issue is that the scale has changed and the odd features are now larger than the even features. We can see the effect of this by computing the accuracy:

estimator = KNeighborsClassifier() 
original_scores = cross_val_score(estimator, X, y,scoring='accuracy')
print("The original average accuracy for is {0:.1f}%".format(np.mean(original_scores) * 100))
broken_scores = cross_val_score(estimator, X_broken, y, scoring='accuracy')
print("The 'broken' average accuracy for is {0:.1f}%".format(np.mean(broken_scores) * 100))

This testing methodology gives a score of 82.3 percent for the original dataset, which drops down to 71.5 percent on the broken dataset. We can fix this by scaling all the features to the range 0 to 1.