How to do it…
Let's see how to evaluate cars based on their characteristics:
- We will use the car.py file that we already provided to you as reference. Let's go ahead and import a couple of packages:
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
- Let's load the dataset:
input_file = 'car.data.txt'
# Reading the data
X = []
count = 0
with open(input_file, 'r') as f:
for line in f.readlines():
data = line[:-1].split(',')
X.append(data)
X = np.array(X)
Each line contains a comma-separated list of words. Therefore, we parse the input file, split each line, and then append the list to the main data. We ignore the last character on each line because it's a newline character. Python packages only work with numerical data, so we need to transform these attributes into something that those packages will understand.
- In the previous chapter, we discussed label encoding. That is what we will use here to convert strings to numbers:
# Convert string data to numerical data
label_encoder = []
X_encoded = np.empty(X.shape)
for i,item in enumerate(X[0]):
label_encoder.append(preprocessing.LabelEncoder())
X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])
X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)
As each attribute can take a limited number of values, we can use the label encoder to transform them into numbers. We need to use different label encoders for each attribute. For example, the lug_boot attribute can take three distinct values, and we need a label encoder that knows how to encode this attribute. The last value on each line is the class, so we assign it to the y variable.
- Let's train the classifier:
# Build a Random Forest classifier
params = {'n_estimators': 200, 'max_depth': 8, 'random_state': 7}
classifier = RandomForestClassifier(**params)
classifier.fit(X, y)
You can play around with the n_estimators and max_depth parameters to see how they affect classification accuracy. We will actually do this soon in a standardized way.
- Let's perform cross-validation:
# Cross validation
from sklearn import model_selection
accuracy = model_selection.cross_val_score(classifier,
X, y, scoring='accuracy', cv=3)
print("Accuracy of the classifier: " + str(round(100*accuracy.mean(), 2)) + "%")
Once we train the classifier, we need to see how it performs. We use three-fold cross-validation to calculate the accuracy here. The following result is returned:
Accuracy of the classifier: 78.19%
- One of the main goals of building a classifier is to use it on isolated and unknown data instances. Let's use a single datapoint and see how we can use this classifier to categorize it:
# Testing encoding on single data instance
input_data = ['high', 'low', '2', 'more', 'med', 'high']
input_data_encoded = [-1] * len(input_data)
for i,item in enumerate(input_data):
input_data_encoded[i] = int(label_encoder[i].transform([input_data[i]]))
input_data_encoded = np.array(input_data_encoded)
The first step was to convert that data into numerical data. We need to use the label encoders that we used during training because we want it to be consistent. If there are unknown values in the input datapoint, the label encoder will complain because it doesn't know how to handle that data. For example, if you change the first value in the list from high to abcd, then the label encoder won't work because it doesn't know how to interpret this string. This acts like an error check to see whether the input datapoint is valid.
- We are now ready to predict the output class for this datapoint:
# Predict and print output for a particular datapoint
output_class = classifier.predict([input_data_encoded])
print("Output class:", label_encoder[-1].inverse_transform(output_class)[0])
We use the predict() method to estimate the output class. If we output the encoded output label, it won't mean anything to us. Therefore, we use the inverse_transform method to convert this label back to its original form and print out the output class. The following result is returned:
Output class: acc