Beruflich Dokumente
Kultur Dokumente
Anis Abboud
The Task
Throughout this project, I developed and used multiple machine algorithms to predict whether or not a user is going to give a movie a good rating (>=4) or not. 1. 2. 3. 4. Decision Trees (DT) - implemented Association Rules (AR) - implemented Support Vector Machines (SVM) used libsvm. K-Nearest Neighbors (KNN) used scikit-learn.
5.
Attributes used
The following attributes and modifications were used from the provided data: Movie:
Decade was extracted from the movie title. E.g., 1990s for The Shawshank Redemption (1994). Due to different technologies and themes in different decades, this information could be useful. Genre E.g., Action, Drama, etc.
User:
Gender (M/F) Age group Occupation Zip code prefix extracted from the given zip code. The zip code itself can have a lot of variation and less meaning. However, the first digit can tell approximately from which region/states the user is. Since the mindset might vary e.g., between the east coast and the west coast, this information could be useful. On the next slide, you can see a US map split into the 10 zip code regions.
http://en.wikipedia.org/wiki/File:ZIP_Code_zones.svg
Technical note
Throughout this presentation, I focus on the data subset that was provided for the third part of the project and contains movies on or after year 2000, and classifying reviews with 4 or 5 stars. This is because:
1. The requirements for the first 3 parts of the project varied, and this is the only subset that was consistent throughout them all. Thus, the only one I can run clear comparisons on between all the algorithms. It saves on runtime.
2.
Since the data subset contains only 28K instances out of 700K total, the accuracies I received are expected to go up when using the entire data, leading to better results. In addition, since all the movie in this subset are from the 2000s, the decade attribute had no added value, while it could be valuable if used with the entire data.
1. Decision Trees
Technical notes:
Nodes are split based on the attribute which minimizes the entropy. Leaf nodes with less than 10 examples were pruned to avoid over-fitting, and classification inside a node is based on the average. For numerical attributes, range-splits were checked (e.g., age <= 25) in addition to standard splits (e.g., age == 25 or gender == M). Since many movies have multiple genres, splits were based on whether the movie has the genre or not (e.g., genres include Action).
Throughout this presentation, we will compare the different approaches based on their MCC score (Matthews correlation coefficient) - the higher the better.
Using averages
We have a limited set of attributes for the movies and users, but still, can we do better? Idea:
If a movie is rated high (or low) by most users, its more likely to get rated high (or low) by the new user. If a user rates most movies high (or low), they are most likely to rate a new movie high (or low) as well.
So, lets add a new attribute for each movie, and for each user: average rating, and calculate its value based on past reviews for that movie/user in the training data.
We notice a huge bump in the MCC score after applying this! From 17% to 28%.
2. Association Rules
In the second part of the project, we had to implement decision rules. We learned two algorithms for mining association rules:
1. Apriori: Simple but not efficient. 2. FPGrowth Tree: Efficient but very complex to code.
Observing that each of these two algorithms has its major downside, I thought of my own algorithm I call it HybridAlgorithm, which compromises efficiency and complexity:
1. Its not complicated can be explained in the next two slides. 2. Its efficient - it ran on the 700K examples in ~10 minutes.
Here are the results, both without and with using the average rating attributes explained in the previous slide:
AR
accuracy precision recall f-score mcc 0.56 0.647 0.304 41 0.157 0.65 0.675 0.594 63 0.303
For classification, it collects all the positive association rules that the given review applies to and calculates the average of their confidence. Similarly for the negative rules. Then it classifies based on the higher average.
def RecursiveItemsetBuild(rules, current_itemset, candidates, next_candidate_index, class_data_subset, class_data_count, all_data_subset): if next_candidate_index == len(candidates): # Stop condition. return # First, try without the current candidate. RecursiveItemsetBuild(rules, current_itemset, candidates, next_candidate_index + 1, class_data_subset, class_data_count, all_data_subset) # Second, try with the current candidate. item = candidates[next_candidate_index] new_itemset = current_itemset + [item] new_class_data_subset = [review for review in class_data_subset if item in review.ar_items] new_all_data_subset = [review for review in all_data_subset if item in review.ar_items] support = len(new_class_data_subset) / class_data_count if support >= MIN_SUPPORT: confidence = len(new_class_data_subset) / float(len(new_all_data_subset)) if confidence >= MIN_CONFIDENCE: rules[frozenset(new_itemset)] = confidence RecursiveItemsetBuild(rules, new_itemset, candidates, next_candidate_index + 1, new_class_data_subset, class_data_count, new_all_data_subset)
This function is called twice, once for mining the positive rules and once for mining the negative rules as follows:
RecursiveItemsetBuild(rules a list to be filled by the recursion, [], the 68 candidates, 0, positive/negative data, number of positive/negative examples, data)
AR
SVM Scikit-learn KNN Scikit-learn Naive Bayes
From this, we conclude that SVM provides the best predictions, with an MCC score of 34%.
However, if you are directing a new movie, you have no access to the average rating (as the movie has not been released yet). What can you still learn from the data?
The association rules algorithm (without using the average rating attribute) mined 63 positive rules and 39 negative rules of the form: genre=Drama,occupation=17,genre=Action->yes: 0.796791 genre=Thriller,genre=Sci-Fi,genre=Horror->no: 0.779783
All the rules can be found in the notes section of this slide (below).
Conclusions
If you are looking to predict if a user is going to like a movie (give a good review), e.g., in order to decide which movies to recommend for that user, then I would recommend using the SVM classifier, which achieved the highest MCC score of 34%. If you are looking to make a new movie, I would recommend going for the following genres: Drama, Action, Comedy, Animation, War, and Children's, and avoiding: Thriller, Horror, Mystery, Sci-Fi, War, and Adventure, based on the previous slide, in which I show a simple classifier that classifies only based on this, and achieves an MCC score of 13%, which is very close to what the real complex algorithms achieved without using the average rating attribute, and is actually higher than the MCC score that was required for the second homework (association rules) of 10%. Although we could learn a lot from the given data, the data is very limited and lacking, making it hard to achieve higher accuracies and scores, even when using complex algorithms like SVM. To better predict a movies performance, one needs to consider numerous additional attributes, including:
Cast, director, budget, sequel or not, marketing, popularity, etc
Future work
1. 2. 3. Get more data and attributes. This will facilitate achieving better results. Try more algorithms. E.g., on WEKA. Analyze the value of each attribute. E.g., is gender actually useful? Remove each attribute from the data and check if that drives the scores down if yes, its probably a useful attribute. Parameter tuning: Many of the algorithms used have parameters that can be tuned in order to try and get better results. E.g., the c (cost) parameter for SVM.
4.
Anis Abboud