Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Putting it all together

For the first feature, we will create a feature that tells us if the home team is generally better than the visitors. To do this, we will load the standings (also called a ladder in some sports) from the NBA in the previous season. A team will be considered better if it ranked higher in 2015 than the other team.

To obtain the standings data, perform the following steps:

Navigate to http://www.basketball-reference.com/leagues/NBA_2015_standings.html in your web browser.
Select Expanded Standings to get a single list for the entire league.
Click on the Export link.
Copy the text and save it in a text/CSV file called standings.csv in your data folder.

Back in your Jupyter Notebook, enter the following lines into a new cell. You'll need to ensure that the file was saved into the location pointed to by the data_folder variable. The code is as follows:

import os
standings_filename = os.path.join(data_folder, "standings.csv")
standings = pd.read_csv(standings_filename, skiprows=1)

You can view the ladder by just typing standings into a new cell and running
the code:

standings.head()

The output is as follows:

Next, we create a new feature using a similar pattern to the previous feature. We iterate over the rows, looking up the standings for the home team and visitor team. The code is as follows:

dataset["HomeTeamRanksHigher"] = 0
for index, row in dataset.iterrows():
 home_team = row["Home Team"]
 visitor_team = row["Visitor Team"]
 home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]
 visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]
 row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
 dataset.set_value(index, "HomeTeamRanksHigher", int(home_rank < visitor_rank))

Next, we use the cross_val_score function to test the result. First, we extract the dataset:

X_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values

Then, we create a new DecisionTreeClassifier and run the evaluation:

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This now scores 60.9 percent even better than our previous result, and now better than just choosing the home team every time. Can we do better?

Next, let's test which of the two teams won their last match against each other. While rankings can give some hints on who won (the higher ranked team is more likely to win), sometimes teams play better against other teams. There are many reasons for this--for example, some teams may have strategies or players that work against specific teams really well. Following our previous pattern, we create a dictionary to store the winner of the past game and create a new feature in our data frame. The code is as follows:

last_match_winner = defaultdict(int)
dataset["HomeTeamWonLast"] = 0

for index, row in dataset.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
    # Set in the row, who won the last encounter
    home_team_won_last = 1 if last_match_winner[teams] == row["Home Team"] else 0
    dataset.set_value(index, "HomeTeamWonLast", home_team_won_last)
    # Who won this one?
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner

This feature works much like our previous rank-based feature. However, instead of looking up the ranks, this features creates a tuple called teams, and then stores the previous result in a dictionary. When those two teams play each other next, it recreates this tuple, and looks up the previous result. Our code doesn't differentiate between home games and visitor games, which might be a useful improvement to look at implementing.

Next, we need to evaluate. The process is pretty similar to before, except we add the new feature into the extracted values:

X_lastwinner = dataset[[ "HomeTeamWonLast", "HomeTeamRanksHigher", "HomeLastWin", "VisitorLastWin",]].values
clf = DecisionTreeClassifier(random_state=14, criterion="entropy")

scores = cross_val_score(clf, X_lastwinner, y_true, scoring='accuracy')

print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This scores 62.2 percent. Our results are getting better and better.

Finally, we will check what happens if we throw a lot of data at the Decision Tree, and see if it can learn an effective model anyway. We will enter the teams into the tree and check whether a Decision Tree can learn to incorporate that information.

While decision trees are capable of learning from categorical features, the implementation in scikit-learn requires those features to be encoded as numbers and features, instead of string values. We can use the LabelEncoder transformer to convert the string-based team names into assigned integer values. The code is as follows:

from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
encoding.fit(dataset["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
X_teams = np.vstack([home_teams, visitor_teams]).T

We should use the same transformer for encoding both the home team and visitor teams. This is so that the same team gets the same integer value as both a home team and visitor team. While this is not critical to the performance of this application, it is important and failing to do this may degrade the performance of future models.

These integers can be fed into the Decision Tree, but they will still be interpreted as continuous features by DecisionTreeClassifier. For example, teams may be allocated integers from 0 to 16. The algorithm will see teams 1 and 2 as being similar, while teams 4 and 10 will be very different--but this makes no sense as all. All of the teams are different from each other--two teams are either the same or they are not!

To fix this inconsistency, we use the OneHotEncoder transformer to encode these integers into a number of binary features. Each binary feature will be a single value for the feature. For example, if the NBA team Chicago Bulls is allocated as integer 7 by the LabelEncoder, then the seventh feature returned by the OneHotEncoder will be a 1 if the team is Chicago Bulls and 0 for all other features/teams. This is done for every possible value, resulting in a much larger dataset. The code is as follows:

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
X_teams = onehot.fit_transform(X_teams).todense()

Next, we run the Decision Tree as before on the new dataset:

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This scores an accuracy of 62.8 percent. The score is better still, even though the information given is just the teams playing. It is possible that the larger number of features were not handled properly by the decision trees. For this reason, we will try changing the algorithm and see if that helps. Data mining can be an iterative process of trying new algorithms and features.