Learning Data Mining with Python(Second Edition)
上QQ阅读APP看书,第一时间看更新

Standard pre-processing

The pre-processing we will perform for this experiment is called feature-based normalization, which we perform using scikit-learn's MinMaxScaler class. Continuing with the Jupyter Notebook from the rest of this chapter, first, we import this class:

fromsklearn.preprocessing import MinMaxScaler

This class takes each feature and scales it to the range 0 to 1. This pre-processor replaces the minimum value with 0, the maximum with 1, and the other values somewhere in between based on a linear mapping.

To apply our pre-processor, we run the transform function on it. Transformers often need to be trained first, in the same way that the classifiers do. We can combine these steps by running the fit_transform function instead:

X_transformed = MinMaxScaler().fit_transform(X)

Here, X_transformed will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.

There are various other forms of normalizing in this way, which is effective for other applications and feature types:

  • Ensure the sum of the values for each sample equals to 1, using sklearn.preprocessing.Normalizer
  • Force each feature to have a zero mean and a variance of 1, using sklearn.preprocessing.StandardScaler, which is a commonly used starting point for normalization
  • Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using sklearn.preprocessing.Binarizer

We will use combinations of these pre-processors in later chapters, along with other types of Transformers object.

Pre-processing is a critical step in the data mining pipeline and one that can mean the difference between a bad and great result.