java-collaborative-filtering-recommendations

Posted on: 2019-10-08 2019-10-08
Tags: Algorithms

A Collaborative Filtering Recommendation System in Java

Algorithms

1. Introduction

In this tutorial, we’ll learn all about the
Slope One algorithm in Java.

We’ll also show the example implementation for the problem of
Collaborative Filtering (CF) – a machine learning technique used by
recommendation systems.

This can be used, for example, to predict user interests for specific
items.

2. Collaborative Filtering

The Slope One algorithm is an item-based collaborative filtering system.
It means that it is completely based on the user-item ranking. When we
compute the similarity between objects, we only know the history of
rankings, not the content itself. This similarity is then used to
predict potential user rankings for user-item pairs not present in the
dataset.

The
image
below show the complete process of obtaining and calculating the rating
for the specific user:

/uploads/Collaborative_filtering_new.gif

At first, users rate different items in the system. Next, the algorithm
calculates the similarities. After that, the system makes predictions
for user-item ratings, which the user hasn’t rated yet.

For more details on the topic of the collaborative filtering, we can
refer to the
Wikipedia article.

3. The Slope One Algorithm

Slope One was named as the simplest form of non-trivial item-based
collaborative filtering based on ratings. It takes into account both
information from all users who rated the same item and from the other
items rated by the same user to calculate the similarity matrix.

In our simple example, we are going to predict user rankings on the
items in the store.

Let’s start with a simple Java model for our problem and domain.

3.1. Java Model

In our model we have two main objects – items and users. The Item
class contains the name of the item:

private String itemName;

On the other hand, the User class contains the username:

private String username;

Finally, we have a InputData class that will be used to initialize the
data. Let us assume that we’ll create five different products in the
store:

List<Item> items = Arrays.asList(
  new Item("Candy"),
  new Item("Drink"),
  new Item("Soda"),
  new Item("Popcorn"),
  new Item("Snacks")
);

Moreover, we’ll create three users that randomly rated some of the
above-mentioned using the scale from 0.0-1.0, where 0 means no interest,
0.5 somehow interested and 1.0 means totally interested. As a result of
data initialization we’ll get a Map with user item-ranking data:

Map<User, HashMap<Item, Double>> data;

3.2. Differences and Frequencies Matrices

Based on the available data, we’ll calculate the relationships between
the items, as well as the number of items’ occurrences. For each user,
we check his/her rating of the items:

for (HashMap<Item, Double> user : data.values()) {
    for (Entry<Item, Double> e : user.entrySet()) {
        // ...
    }
}

In the next step, we check if the item is existing in our matrices. If
this is a first occurrence, we create the new entry in the maps:

if (!diff.containsKey(e.getKey())) {
    diff.put(e.getKey(), new HashMap<Item, Double>());
    freq.put(e.getKey(), new HashMap<Item, Integer>());
}

The first matrix is used to calculate the differences between the user
ratings. The values of it might be positive or negative (since the
difference between ratings might be negative), and are stored as
Double. On the other hand, the frequencies are stored as Integer
values.

In the next step we are going to compare the ratings of all items:

for (Entry<Item, Double> e2 : user.entrySet()) {
    int oldCount = 0;
    if (freq.get(e.getKey()).containsKey(e2.getKey())){
        oldCount = freq.get(e.getKey()).get(e2.getKey()).intValue();
    }

    double oldDiff = 0.0;
    if (diff.get(e.getKey()).containsKey(e2.getKey())){
        oldDiff = diff.get(e.getKey()).get(e2.getKey()).doubleValue();
    }

    double observedDiff = e.getValue() - e2.getValue();
    freq.get(e.getKey()).put(e2.getKey(), oldCount + 1);
    diff.get(e.getKey()).put(e2.getKey(), oldDiff + observedDiff);
}

If somebody rated the item before, we increase the frequency count by
one. Moreover, we check the average difference between the item’s
ratings and calculate the new observedDiff.

Please note, that we put the sum of oldDiff and observedDiff as a
new value of an item.

Finally, we calculate the similarity scores inside the matrices:

for (Item j : diff.keySet()) {
    for (Item i : diff.get(j).keySet()) {
        double oldValue = diff.get(j).get(i).doubleValue();
        int count = freq.get(j).get(i).intValue();
        diff.get(j).put(i, oldValue / count);
    }
}

The main logic is to divide the calculated item rating’s difference by
the number of its occurrences. After that step, we can print out our
final differences matrix.

3.3. Predictions

As the main part of the Slope One, we are going to predict all missing
ratings based on the existing data. In order to do that, we need to
compare the user-item ratings with differences matrix calculated in the
previous step:

for (Entry<User, HashMap<Item, Double>> e : data.entrySet()) {
    for (Item j : e.getValue().keySet()) {
        for (Item k : diff.keySet()) {
            double predictedValue =
              diff.get(k).get(j).doubleValue() + e.getValue().get(j).doubleValue();
            double finalValue = predictedValue * freq.get(k).get(j).intValue();
            uPred.put(k, uPred.get(k) + finalValue);
            uFreq.put(k, uFreq.get(k) + freq.get(k).get(j).intValue());
        }
    }
    // ...
}

After that, we need to prepare the “clean” predictions using the code
below:

HashMap<Item, Double> clean = new HashMap<Item, Double>();
for (Item j : uPred.keySet()) {
    if (uFreq.get(j) > 0) {
        clean.put(j, uPred.get(j).doubleValue() / uFreq.get(j).intValue());
    }
}
for (Item j : InputData.items) {
    if (e.getValue().containsKey(j)) {
        clean.put(j, e.getValue().get(j));
    } else {
        clean.put(j, -1.0);
    }
}

The trick to consider with larger data set is to use only the item
entries that have a large frequency value (for example > 1). Please
note, that if the prediction is not possible, the value of it will be
equal to -1.

Finally, very important note. If our algorithm worked correctly, we
should receive the predictions for items that user didn’t rate, but also
the repeated ratings for the items that he rated. Those repeated
rating’s should not change, otherwise it means that there is a bug in
your algorithm implementation.

3.4. Tips

There are few major factors that affect Slope One algorithm. Here are
the few tips how to increase the accuracy and processing time:

consider obtaining the user-item ratings on the DB side for large data
sets
set the time frame for ratings fetching, as people interests might
change over the time – it will also reduce the time required to process
input data
split large data sets into smaller ones – you don’t need to calculate
predictions for all users every day; you can check if the user
interacted with the predicted item and then add/remove him/her from
processing queue for the next day

4. Conclusion

In this tutorial we were able to learn about the Slope One algorithm.
Moreover, we introduced the collaborative filtering problem for item
recommendation systems.

The full implementation of this tutorial can be found in
the
GitHub project.

getdocs

3043