Standard pre-processing
The pre-processing we will perform for this experiment is called feature-based normalization, which we perform using scikit-learn's MinMaxScaler class. Continuing with the Jupyter Notebook from the rest of this chapter, first, we import this class:
fromsklearn.preprocessing import MinMaxScaler
This class takes each feature and scales it to the range 0 to 1. This pre-processor replaces the minimum value with 0, the maximum with 1, and the other values somewhere in between based on a linear mapping.
To apply our pre-processor, we run the transform function on it. Transformers often need to be trained first, in the same way that the classifiers do. We can combine these steps by running the fit_transform function instead:
X_transformed = MinMaxScaler().fit_transform(X)
Here, X_transformed will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.
There are various other forms of normalizing in this way, which is effective for other applications and feature types:
- Ensure the sum of the values for each sample equals to 1, using sklearn.preprocessing.Normalizer
- Force each feature to have a zero mean and a variance of 1, using sklearn.preprocessing.StandardScaler, which is a commonly used starting point for normalization
- Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using sklearn.preprocessing.Binarizer
We will use combinations of these pre-processors in later chapters, along with other types of Transformers object.
Pre-processing is a critical step in the data mining pipeline and one that can mean the difference between a bad and great result.