Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Understanding the Apriori algorithm and its implementation

The goal of this chapter is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person who recommends a set of movies, is likely to recommend another particular movie.

To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie:

all_ratings["Favorable"] = all_ratings["Rating"] > 3

We can see the new feature by viewing the dataset:

all_ratings[10:15]

We will sample our dataset to form training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users:

ratings = all_ratings[all_ratings['UserID'].isin(range(200))]

Next, we can create a dataset of only the favorable reviews in our sample:

favorable_ratings_mask = ratings["Favorable"]
favorable_ratings = ratings[favorable_ratings_mask]

We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group:

favorable_reviews_by_users = dict((k, frozenset(v.values)) for k, v in favorable_ratings.groupby("UserID")["MovieID"])

In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user.

Sets are much faster than lists for this type of operation, and we will use them in later code.

Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review:

num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()

We can see the top five movies by running the following code:

num_favorable_by_movie.sort_values(by="Favorable", ascending=False).head()

Let's see the top five movies list. We only have IDs now, and will get their titles later in the chapter.