Anda di halaman 1dari 8

Research Article

Movie Recommender System using


Genetic Algorithm
Jyoti Joshi1

Abstract
Recommender systems have become extremely common in recent years, and are utilized in a variety
of areas: some popular applications include movies, music, news, books, research articles, search
queries, social tags, and products in general. Traditional recommendation techniques in
recommender systems mainly use content based or collaborative filtering techniques. These systems
only use the product ratings given by the users to predict/recommend new products or items to the
user. They do not consider other attributes while generating recommendations for a user.

This article describes a new recommendation system that uses genetic algorithm to learn about the
preferences of the users and provides recommendations based on these preferences. This research
uses Movie Lens (http://www.movielens.umn.edu) database and the genetic algorithm combines
features (22) from different files present in the dataset. These features are then used to train the
system. The 22 features are - movie rating, age, sex, occupation and 18 movie genres like action,
adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror,
musical, mystery, romance, sci-fi, thriller, war and western.

Keywords: Content based filtering, Collaborative filtering, Genetic algorithms, Recommender


system

Introduction
In everyday life it is often necessary to make a decision without resorting to personal experience of various
alternatives. When there are many alternatives it is difficult for users to make appropriate decisions. So people rely on
recommendations from other people’s knowledge or advertisements and reviews about the products either offline or
online.

Recommender systems are thus useful especially in this current age of internet where people are buying all sorts of
products like the daily essential needs like groceries, online. Many largest e-commerce and social media companies
are using recommender systems to assist their customers in searching items they would like to purchase. These
systems provide with the search results tailored to user’s own preference.1

Recommender systems generally use either content based or collaborative or hybrid techniques for
recommendations. In this article, a new recommendation system is proposed that uses an elitist genetic algorithm
together with some features of collaborative filtering and trains it on 22 movie features to generate
recommendations.

1
Dr A.P.J. Abdul Kalam Technical University, lucknow.
E-mail Id: jyotijoshi1222@gmail.com

Orcid Id: http://orcid.org/0000-0001-8269-4082


How to cite this article: Joshi J. Movie Recommender System using Genetic Algorithm. J Adv Res Appl Arti Intel Neural Netw 2017;
4(1&2): 28-35.

© ADR Journals 2017. All Rights Reserved.


J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Joshi J

This article is organized as follows. Section II reviews In the alternative approach, the content-based filtering,
related work and describes the structure of the the recommender system examines the description of
proposed recommender system. Section III explains the the items which are rated higher than others from
genetic algorithm used. Section IV has the experimental users. After this step, the system analyzes the similarity
results and analysis and finally in Section V this article is between examined items and all of the remaining items.
concluded. The system then makes recommendation of new items
by ordering based on its high similarities with the
Related Work selected items.3,4 However, this approach has limitation
that it focuses on only accessed items.
Recommender Systems
We combine the collaborative filtering with an elitist
The main issue of a recommender system is how to genetic algorithm and use not only the ratings of each
recommend items tailored with user’s preference from movie but other features like age, gender and movie
resources. The recommender system also must genres as well to train the system and generate
recognize and provide items corresponding with recommendations for the user.
favorites of users. To resolve this matter there are 2
main approaches: collaborative filtering and content- Generating Profiles
based filtering.2
Before recommendations can be made, the movie data
In the collaborative filtering approach, the is processed into separate profiles, one for each person,
recommender system provides recommendations by defining that person’s movie preferences. Profile (j, i) is
collecting users’ profiles and discovers relations defined to mean the profile for user j on movie item i,
between each profile. After identifying correlation of see fig. 1. The profile of j, profile (j) is therefore a
each profile, the system classifies users having profiles collection of profile (j, i) for all the movies i that j has
that are similar to the others. The system then seen.
recommends items derived from other profiles in the
same group. The advantage of this system is that it has Rating, Age, Gender, Occupation ……… 22.18 Genre
high probability to recommend items corresponding to frequencies
user’s preference by providing environments in which
user can share his or her own profile.3,4

4 35 0 20 000000100010001100
Figure 1.Profile (j, i) – Profile for User j with Rating on Movie Item i, if i has a Rating of 4

Once profiles are built, the process of recommendation Selecting Neighboring Profiles
can begin. Given an active user A, a set of neighborhood
profiles similar to profile (A) must be found. The success of a collaborative filtering system is highly
dependent upon finding neighborhood of profiles that
From the Movie Lens database the ml100k data is used. are most similar to that of the active user. So only the
From this data u.item, u.data and u.user files are used to best or closest profiles should be chosen and used to
create the user profiles. The u.item file contains movie generate new recommendations for the user.
Id, movie name together with 18 bits corresponding to
movie genres. The movie Id and genres are used from In an ideal world the entire database of profiles would
this file. Each entry in u.data file has user Id, movie Id be used to select the best possible profiles. But this is
and corresponding rating. So for each user multiple not a feasible option when the dataset is very large.
entries for movie Id, rating pair are created. Thus, most system opt for random sampling and this is
what is done in this algorithm.
The data collected from other 2 files is combined with
u.user file to create profile (j, i). File u.user contains user Once a set of profiles are selected the distance or
Id, age, gender and occupation fields for each user. For similarity between selected profiles and current user’s
each user Id and movie Id pair an entry for profile must be computed. Most current recommender
corresponding rating, age, gender, occupation, and system use standard algorithms that consider only the
genres is created. movie ratings on which the comparison between 2
profiles is made. In real life however, two people are

29
Joshi J J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)

said to be similar not only on the basis of having profile value for feature j between users A and B on
different opinions on a particular subject but also on movie item i.
other factors like their background, preferences etc.
Before this calculation is made, the profile values are
We can apply the same thing here and consider normalized to ensure that they lie between 0 and 1.
demographic information like user’s age, gender and When the weight for any feature is zero, that feature is
preferences of movie genres. Each user places a ignored. This way feature selection is made adaptive to
different importance or priority on each feature. The each user’s preferences. The difference in the profile
current approach shows how weights defining user’s values for occupation is either 0, if the 2 users have the
priorities can be evolved by a genetic algorithm. same occupation or 1 otherwise.

The comparison between two profiles can now be Making Recommendations


conducted using a modified Euclidean distance function
that takes into account multiple features. Euclidean (A, Once the Euclidean distances, Euclidean (A, B), have
B) is the similarity between active user A and user B: been found between profile (A) and profile (B), profile
(C), profile (D)… the best profiles are found. Each profile
is ranked according to its similarity to profile (A). The
system then simply selects the users whose Euclidean
distance is above a threshold value as the neighborhood
of A. This threshold is a system constant that can be
Where: A is the active user; B is a user provided by changed. For generating results presented in this article
profile selection process, B ≠ A; n is the number of this constant was kept at 0.2. To make a
common movies that users A and B have rated; j is one recommendation for user A, it is necessary to find movie
of the 22 features; is the active user’s weight from items seen and liked by the users in the neighborhood
feature j; i is the common movie item, where profile(A, set that the active user has not seen. These
recommendations are then presented to the active user
i) and profile (B, i) exists (A, B) is the difference in
through a user interface.

Figure 2.Calculating the similarity between A and B

30
J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Joshi J

Proposed Genetic Algorithm importance of 18 movie genres. This is done because 18


genres are sub actually sub categories of a larger feature
Genetic Algorithms (GAs) are stochastic search methods – genre. Reducing the effect of these weights is
inspired from the mechanism of natural evolution and therefore intended to give the other unrelated features
genetic inheritance. GAs work on a population of (movie rating, age, gender and occupation) a more
candidate solutions; each solution has a fitness value equal change of being used. Second, the total value of
indicating its closeness to the optimal solution of the phenotype is then calculated by summing the real
problem. The solutions having higher fitness values than values for all 22 features. Finally, the weighting value for
others are selected and also survive to the next each feature can be found by dividing the real value by
generation. GAs then produce better offspring i.e. new the total value. The sum of all the weights will then add
solutions by the combination of selected solutions. The up to unity.
methods can discover, preserve, and propagate
promising sub-solutions.5,6 Fitness Function

Some Basic Terminology of GAs: Generating good recommendations is dependent on


finding good set of weights for the 22 features. A poor
• Population − It is a subset of all the possible set of weights would result in a poor neighborhood set
(encoded) solutions to the given problem. The of profiles for the active user and hence poor
population for a GA is analogous to the population recommendations. A good set of weights would result in
for human beings except that instead of human a good neighboring set and so good recommendations.
beings, we have Candidate Solutions representing
human beings. It was decided to reformulate the problem as a
• Chromosomes − A chromosome is one such supervised learning task. It is possible to predict what
solution to the given problem. active user A might think of movies. For example, if a
• Gene − A gene is one element position of a certain movie is suggested because similar users saw it
chromosome. but only rated the movie as ‘average’ then it is likely
• Allele − It is the value a gene takes for a that the active user might also think that the movies was
chromosome. ‘average’. Hence for Movie Lens database it is possible
• Genotype − Genotype is the population in the to both recommend new movies and predict how the
computation space. In the computation space, the user would rate each movie if he sees it.
solutions are represented in a way which can be
The predicted vote computation is taken from and
easily understood and manipulated using a modified such that the Euclidean distance function
computing system.
replaces the weight in the original equation.7 The
• Phenotype − Phenotype is the population in the predicted vote, predict vote (A, i) for user A on item i,
actual real world solution space in which solutions can be defined as:
are represented in a way they are represented in
real world situations. predict_vote(A,i)=meanA+kΣnj=1 euclidean(A,j)(vote (j,i)-
meanj)
An elitist genetic algorithm was chosen for this task,
where a quarter of the best individuals in the population Where: meanJ is the mean vote for user j; k is the
are kept for the next generation. When creating a new normalizing factor such that the sum of the Euclidean
generation individuals are selected randomly out of the distances is equal to 1 vote (j, i) is actual vote of user j
top 40% of the whole population to be parents. Two for item I; n is the size of the neighborhood. All the
offspring’s are produced from each pair of parents, movie items that the active user has seen are randomly
using single-point crossover with probability 1.0. partitioned into two datasets: a training set (1/3) and a
Mutation is applied to each locus in genotype with test set (2/3). To calculate a fitness measure for an
probability of 0.01. A simple unsigned binary genetic evolved set of weights, the recommender system finds a
encoding is used in the implementation, using 8 bits for set of neighborhood profiles for the active user as
each of the 22 genes. The GA begins with random described section II. The ratings of the users in the
genotypes. neighborhood set are then used to compute the
predicted rating for the active user on each movie item
A genotype is mapped to a phenotype (set of feature
in the training set. Because the active user has already
weights) by converting the alleles of the binary genes to
rated movies, it is possible to compare the actual rating
decimal. The feature weights can then be calculated
with the predicted rating. So the fitness score is
from these real values. First a given factor reduces the

31
Joshi J J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)

computed as the average of differences between actual This score is used to guide the future generations of
and predicted ratings of all movies in the training set. weight evolution, see Fig. 3.
Profile Selection and Matching

Figure 3.Finding the Fitness Score of an Individual (The Active User’s Feature Weights)

Experiments and Result Analysis recommender system based on the Pearson algorithm.7
In each set of experiments, the predicted votes of all the
Experiments movie items in the test set (the items that the active
user has rated but were not used in weights evolution)
Four sets of experiments were designed to observe the were computed using the final feature weights for that
difference in performance between the evolutionary run. These votes were then compared against those
recommender system and a standard, non-adaptive produced from the simple Pearson algorithm.

Figure 4.Result for Experiment 1

32
J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Joshi J

Figure 5.Result for Experiment 2

Figure 6.Result for Experiment 3

Figure 7.Result for Experiment 4

33
Joshi J J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2)

Experiment 1: Each of the first 10 users was picked as for the GA recommender. All 10 active users performed
the active user in turn, and the first 10 users (fixed) better than the Pearson algorithm.
were used to provide recommendations.
The results for the last experiment show that the
Experiment 2: Each of the first 10 users was picked as accuracy for the GA recommender was significantly
the active user in turn, and 10 users were picked better for all but 15 active users.
randomly and used to provide recommendations.
Analysis of Results
Experiment 3: Each of the first 50 users was picked as
the active user in turn, and the first 50 users (fixed) Experiment 1 indicates that the prediction accuracy for
were used to provide recommendations. the active user 6, 8 and 9 on the GA recommender was
worse than that obtained from using the Pearson
Experiment 4: Each of the first 50 users was picked as algorithm. But when the number of users was increased
the active user in turn, and 50 users were picked to 50 in experiment 3, the accuracy for the three
randomly and used to provide recommendations. mentioned active users rose and outperformed the
other algorithm. This was expected – as the number of
Each graph above shows the percentage of the number users goes up, the probability of finding a better
of ratings that the system predicted correctly out of the matched profile should be higher and hence accuracy of
total number of available ratings by the current active the predictions should also increase.
user. Whilst the predictions computed with the Pearson
algorithm always remain the same given the same The results suggest that random sampling is a good
parameter values, those obtained from the GA vary per choice for the profile selection task of retrieving profiles
the feature weights of that run. Out of the 10 runs for from the database. Random sampling was expected to
each active user in each experiment, the run with the be better than fixing which users to select because it
best feature weights (that gave the highest percentage allowed the search to consider a greater variety of
of right predictions) was chosen and plotted against the profiles (potentially 10*10 runs = 100 users in
result from the Pearson algorithm. experiment 2 and 50 * 10 = 500 users in experiment 4)
and hence find a better set of well matched profiles.
In the first experiment, the GA recommender performed
equally well (or better) compared to the Pearson As mentioned earlier, only the run(s) with the best
algorithm on 7 active users out of 10. In the third feature weights for each active user were considered for
experiment, out of the 50 users the accuracy for the GA this analysis.
recommender fell below that of the Pearson algorithm
for 17 active users. On the rest of the active users, the Looking at the final feature weights obtained for each
accuracy for the GA recommender was found to be active user, many interesting observations have been
better – in some cases (user 16) the difference was as found.
great as 32%. The random sampling for experiment 2
showed great improvement on the prediction accuracy Let’s focus on a couple of active users - 4 and 27.

Figure 8.Feature Weights for Active User 4

34
J. Adv. Res. Appl. Arti. Intel. Neural Netw. 2017; 4(1&2) Joshi J

The weights for feature 5-22 would be lower because of be found. From the feature weights it can be seen that
the scaling factor applied. he gives more preference to War, thriller and horror
movies which you would expect from a 24 year old boy.
Active user 4 is 24 year old male who is a technician by
occupation. This user gives maximum preference to 2nd Another active user 27 is analyzed who is a 40 year old
feature which is age. So it is likely that in this user’s female and is a librarian by profession.
neighborhood other users with similar age group would

Figure 9.Feature Weights for Active User 27

This user gives more weight age to age and gender. She References
has interests in Western, sci-fi, romance, drama, crime
and children’s genres. She is a 40 year old female and so 1. Schafer J, Konstan J, Riedl J. Recommender Systems
might have small children and that is she has interests in in E-commerce. ACM conference on Electronic
sci-fi and children’s genres. She is a woman and so Commerce, USA. 1999. pp. 158-166.
would like movies with romance and drama like most 2. Balabanovic M, Shoham Y. FAB: content-based,
other women her age and given her profession. collaborative recommendation. Communications of
the ACM 1997; 40(3): 66-72.
Conclusion 3. Burke R. Hybrid web recommender systems. The
Adaptive Web - Lecture Notes in Computer Science,
This work has shown how evolutionary search can be 2007. pp. 377-408.
employed to fine-tune a profile-matching algorithm 4. Pazzani MJ. A Framework for Collaborative,
within a recommender system, tailoring it to the Content-based and Demographic Filtering. Artificial
preferences of individual users. Intelligence Review 1999; 13(5-6): 394-408.
5. Mitchell M. An Introduction to Genetic Algorithm.
This was achieved by reformulating the problem of
MIT Press, 1998.
making recommendations into a supervised learning
6. Goldberg DE, Holland JH. Genetic algorithms and
task, enabling fitness scores to be computed by
machine learning. Machine Learning 1988; 3(2-3):
comparing predicted votes with actual votes.
95-9.
Experiments demonstrated that, compared to a non-
7. Breese JS, Heckerman D, Kadie C. Empirical analysis
adaptive approach, the evolutionary recommender
of predictive algorithms for collaborative filtering.
system was able to successfully fine-tune the profile
Conference on Uncertainty in Artificial Intelligence,
matching algorithm. This enabled the recommender
1998. pp. 43-52.
system to make more accurate predictions, and hence
better recommendations to users.

35