Managing categorical data
In many classification problems, the target dataset is made up of categorical labels which cannot immediately be processed by any algorithm. An encoding is needed and scikit-learn offers at least two valid options. Let's consider a very small dataset made of 10 categorical samples with two features each:
import numpy as np
>>> X = np.random.uniform(0.0, 1.0, size=(10, 2))
>>> Y = np.random.choice(('Male','Female'), size=(10))
>>> X[0]
array([ 0.8236887 , 0.11975305])
>>> Y[0]
'Male'
The first option is to use the LabelEncoder class, which adopts a dictionary-oriented approach, associating to each category label a progressive integer number, that is an index of an instance array called classes_:
from sklearn.preprocessing import LabelEncoder
>>> le = LabelEncoder()
>>> yt = le.fit_transform(Y)
>>> print(yt)
[0 0 0 1 0 1 1 0 0 1]
>>> le.classes_array(['Female', 'Male'], dtype='|S6')
The inverse transformation can be obtained in this simple way:
>>> output = [1, 0, 1, 1, 0, 0]
>>> decoded_output = [le.classes_[i] for i in output]
['Male', 'Female', 'Male', 'Male', 'Female', 'Female']
This approach is simple and works well in many cases, but it has a drawback: all labels are turned into sequential numbers. A classifier which works with real values will then consider similar numbers according to their distance, without any concern for semantics. For this reason, it's often preferable to use so-called one-hot encoding, which binarizes the data. For labels, it can be achieved using the LabelBinarizer class:
from sklearn.preprocessing import LabelBinarizer
>>> lb = LabelBinarizer()
>>> Yb = lb.fit_transform(Y)
array([[1],
[0],
[1],
[1],
[1],
[1],
[0],
[1],
[1],
[1]])
>>> lb.inverse_transform(Yb)
array(['Male', 'Female', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male',
'Male', 'Male'], dtype='|S6')
In this case, each categorical label is first turned into a positive integer and then transformed into a vector where only one feature is 1 while all the others are 0. It means, for example, that using a softmax distribution with a peak corresponding to the main class can be easily turned into a discrete vector where the only non-null element corresponds to the right class. For example:
import numpy as np
>>> Y = lb.fit_transform(Y)
array([[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[1, 0, 0, 0, 0]])
>>> Yp = model.predict(X[0])
array([[0.002, 0.991, 0.001, 0.005, 0.001]])
>>> Ypr = np.round(Yp)
array([[ 0., 1., 0., 0., 0.]])
>>> lb.inverse_transform(Ypr)
array(['Female'], dtype='|S6')
Another approach to categorical features can be adopted when they're structured like a list of dictionaries (not necessarily dense, they can have values only for a few features). For example:
data = [
{ 'feature_1': 10.0, 'feature_2': 15.0 },
{ 'feature_1': -5.0, 'feature_3': 22.0 },
{ 'feature_3': -2.0, 'feature_4': 10.0 }
]
In this case, scikit-learn offers the classes DictVectorizer and FeatureHasher; they both produce sparse matrices of real numbers that can be fed into any machine learning model. The latter has a limited memory consumption and adopts MurmurHash 3 (read https://en.wikipedia.org/wiki/MurmurHash, for further information). The code for these two methods is shown as follows:
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
>>> dv = DictVectorizer()
>>> Y_dict = dv.fit_transform(data)
>>> Y_dict.todense()
matrix([[ 10., 15., 0., 0.],
[ -5., 0., 22., 0.],
[ 0., 0., -2., 10.]])
>>> dv.vocabulary_
{'feature_1': 0, 'feature_2': 1, 'feature_3': 2, 'feature_4': 3}
>>> fh = FeatureHasher()
>>> Y_hashed = fh.fit_transform(data)
>>> Y_hashed.todense()
matrix([[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
In both cases, I suggest you read the original scikit-learn documentation to know all possible options and parameters.
When working with categorical features (normally converted into positive integers through LabelEncoder), it's also possible to filter the dataset in order to apply one-hot encoding using the OneHotEncoder class. In the following example, the first feature is a binary index which indicates 'Male' or 'Female':
from sklearn.preprocessing import OneHotEncoder
>>> data = [
[0, 10],
[1, 11],
[1, 8],
[0, 12],
[0, 15]
]
>>> oh = OneHotEncoder(categorical_features=[0])
>>> Y_oh = oh.fit_transform(data1)
>>> Y_oh.todense()
matrix([[ 1., 0., 10.],
[ 0., 1., 11.],
[ 0., 1., 8.],
[ 1., 0., 12.],
[ 1., 0., 15.]])
Considering that these approaches increase the number of values (also exponentially with binary versions), all the classes adopt sparse matrices based on SciPy implementation. See https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html for further information.