Recommendation Systems

Htoo Latt
8 min readJul 16, 2021

--

The goal of a recommendation system is to help expose people to things that they might enjoy or like to purchase. Without recommendation systems, people would only be exposed to the most popular items and content and would have a hard time finding the niche things that they like. Implementing a recommendation system in service would increase user satisfaction and could directly affect the business’s profit.

In this blog post, I would like to talk about a movie recommender project that I undertook. I will go through the different models that I considered and the approach that I took.

The main approach that I took to complete this project is an iterative method of going through different models and comparing their metrics as well as the pros and cons of each model.

The data that I worked with is the MovieLens dataset provided by the GroupLens research lab at the University of Minnesota. I had the option of working with the whole dataset or a smaller subset. I chose to work with the smaller set to save computational time.

Unpersonalized Recommendations

24 Top Rated Movies, and their ratings and genres.

The first recommendation system that I took a look at is unpersonalized recommendations. Unpersonalized recommendations show only the most popular content. For example, a chart showing the most listen to songs on Spotify, although this introduces the user to the most popular content, fails to expose the user to songs from smaller artists that the user might enjoy a lot more. These systems work well for some types of content such as the news where the biggest stories would be the most read and shown on the first page.

To build this system took the average ratings of all the movies in my dataset and sorted them. Since some of the movies in the dataset have only one or two ratings given to them they could be considered as outliers. In order to get rid of these, I set a limit of at least 20 ratings.

As you can see the list contains critically acclaimed movies that many love. It is a great idea to include a system like this in a service so that the users can take a look and see if they are interested in any of the movies listed.

Personalized Recommendations

Content-based method.

An example of a personalized recommender would be a content-based method. The main idea behind content-based is that if you like an item, you will also like similar items. For example, if you have a favorite book, similar books will be recommended to you. The downside of this method is that it requires manual taggings of the item. People would have to go through books and movies and tagged them not only by genres but also by common plotlines, themes, and settings. This is a monumental task that not all businesses can undertake. Since the MovieLens database have genres already tagged, I decided to use this in the final system to build a hybrid recommender.

Collaborative Filtering Systems.

The idea behind collaborative filtering is that similar users share similar interests and also that users tend to like items that are similar to one another. If user A enjoys items 1, 2, 3, and 4 and user B enjoys items 1, 2, and 3, then the chances are that user B also likes item 4. This system calculates the similarity between items or users. One of the problems with collaborative filtering is the “cold start problem”, which means that with no information on a new user it is impossible to make a recommendation for the person. Solutions to this could include starting off with recommending the most popular items or asking the user to make ratings for items.

Memory Based

There are two types of collaborative filtering systems. The first is a memory-based/neighborhood-based system. This can be either item-based or user-based, where items are compared or users are compared respectively. The difference between content-based and item-based collaborative filtering systems is that tagging is not necessary for the item-based. Instead, the model takes a look at how many common users items have between them.

In my case, I chose to go with the user-based, since the large number of items in our database would have made the computational time too large.

For this project, I extensively made use of the surprise machine learning library from SciKit-Learn.

I built six different memory-based models with cosine and person correlation as similarity metrics. I used the basic K-Nearest Neighbor method, the K with means method, which accounts for the user mean rating. Finally, the KNN Baseline method, which is more advanced since it includes a bias term in the cost function which the method minimizes.

##  Knn Basic using pearson correlation as similarity metricknn_basic_pearson = KNNBasic(sim_options={'name':'pearson', 'user_based':True})cv_knn_basic_pearson = cross_validate(knn_basic_pearson, data, cv=5, n_jobs=-1)print(np.mean(cv_knn_basic_pearson['test_rmse']))##  Knn with means using cosineknn_with_means_cosine = KNNWithMeans(sim_options={'name':'cosine', 'user_based':True})cv_knn_with_means_cosine = cross_validate(knn_with_means_cosine, data, cv=5, n_jobs=-1)print(np.mean(cv_knn_with_means_cosine['test_rmse']))##  The KNN baseline using cosineknn_baseline_cosine= KNNBaseline(sim_options={'name':'cosine', 'user_based':True})cv_knn_baseline_cosine= cross_validate(knn_baseline_cosine, data, cv=5, n_jobs=-1)print(np.mean(cv_knn_baseline_cosine['test_rmse']))

The Results

As you can see on the left, the KNN with means performed much better than the basic KNN but the KNN with baseline performed better than the KNN with means method. The KNN with baseline using Pearson correlation seems to be the best memory-based model.

Model-Based Collaborative Filtering (Matrix Factorization Model)

Matrix Factorization models make use of the concept of the Latent Variable Model. Latent Variables models explain complex relationships between variables using simple relationships between variables and their unobservable, underlying “latent” variables. These models do a great job of reducing the problem from high sparsity of recommendation system databases as all users do not buy and rate every product.

The pros of model-based collaborative filtering include scaling well with larger datasets and complete input data is not required.

The idea behind these models is that the preferences of users can be determined by hidden factors called embeddings that can be gotten using matrix factorization. These embeddings in a movie recommendation system might be how recent the movie is released, how action-driven the movie is, or how dialogue-heavy it is. Whereas the embeddings for a user might include how much does the use like romantic movies and etcetera.

I made use of two different matrix factorization techniques. The first is SVD and the second ALS. SVD is a commonly used method that works with all matrices which makes it very stable. ALS is less computationally efficient than SVD but can be more robust and effective when dealing with large sparse matrices. ALS allows as to set regularization measures when minimizing the loss function while finding the best parameters.

I made use of the Surprise library for the SVD method but for the ALS method, I made use of PySpark. I made use of GridSearch cross-validator to find the best parameters for the model.

SVD Code

# The parameter grid I usedparams = {'n_factors': [20, 50, 100],        'lr_all': [0.002, 0.005],        'n_epochs': [5, 10],        'reg_all': [0.02, 0.05, 0.1]}#Build the SVD system using gridsearchg_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1, cv=5)
g_s_svd.fit(data)
print(g_s_svd.best_score)
print(g_s_svd.best_params)
#OUTPUT{'rmse': 0.8779933149853167, 'mae': 0.6768786842561674} {'rmse': {'n_factors': 20, 'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.05}, 'mae': {'n_factors': 20, 'lr_all': 0.005, 'n_epochs': 10, 'reg_all': 0.05}}

ALS Code

# Read the data into spark datatype
movie_ratings = spark.read.csv('ratings.csv', header='true', inferSchema='true')
# Drop the unnecessary column
movie_ratings = movie_ratings.drop('timestamp')
#Split the dataset into training and testing sets
(training, test) = movie_ratings.randomSplit([0.8, 0.2], seed=1254)
# Initialize the ALS model
als_model = ALS(userCol='userId', itemCol='movieId', ratingCol='rating', coldStartStrategy='drop')
# Create the parameter grid
params = ParamGridBuilder() \
.addGrid(als_model.rank, [4, 10, 50, 100, 150]) \
.addGrid(als_model.regParam, [.01, .05, .1, .15]) \
.build()
#Build the evaluator to be used
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating',predictionCol='prediction')
# Instantiate the cross-validator estimator and then fit it to training data
cv = CrossValidator(estimator=als_model, estimatorParamMaps=params,evaluator=evaluator,parallelism=4, numFolds=5)
best_model = cv.fit(training)predictions = best_model.transform(test)
evaluator.evaluate(predictions)

The Final Results

The scores between matrix factorization models and KNN baseline were so close that the ranking changes with the random draw for the test set. Since I know model-based methods scale better with larger datasets I decided to go with the ALS method.

The Final Recommender (Hybrid Recommender)

After settling on a model, I build different functions that act as my recommender. The first function that I build gives the user randoms movies to rate to better populate the sparse dataset. It returns a pandas dataframe.

New ratings being given for user 200

The second function takes in the dataframe from the movie_rater function, and format it into a form sparks can read. This function is used in my final recommender function.

The third function returns movie titles when given the movie id. This function makes the recommendations interpretable without having to look up movie ids. This function is also used in the final recommender.

The final function serves as my recommender. It has six parameters intake but only requires a user id to give back recommendations. New user ratings can be added to the dataset before the model is built, different datasets can be used, and a different dataframe containing movie titles can be used.

By default, the recommender will return the predicted top-five ranking movies for the user but this can be changed using the “num_recs” parameter. A genre can also be specified so ultimately the recommender serves as a hybrid recommender using both model-based and content-based methods.

Below is an example of the recommender in action for user_id 200.

The new ratings given above are added to the dataset before the recommendations are calculated.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response