H&M Recommendations Challenge - Part 4 - Bringing it all together

Grant Beasley · March 17, 2022

Data Science Neo4j

UPDATE 15th July 2022

I recently presented this project at the Neo4j meetup on 12th July. I’ve uploaded the slides to the github repository here.

Intro

For anyone still reading, well done for making it this far. We’ll try to tie all the previous posts together in this one and create the recommendations we’ll use for the Kaggle entry.

Part 1 - loading the data and EDA
Part 2 - image based recommendations
Part 3 - collaborative filtering and with implicit datasets

The Recommendation Process

So just to recap the data we have available for making our recommendations:

A graph model of all users and purchases including product metadata.
Image embeddings of almost all of the products and along with this we previously calculated the 10 most similar items for each of these images which we wrote back to our Neo4j database.
A matrix factorization based recommendation system built with implicit feedback using Alternating Least Squares courtesy of the python package implicit.

The task now is to combine these recommendations to create the most insightful and useful recommendations. Some of the challenges with each of these pieces of data are:

Image embeddings are powerful however often we may end up selecting items that are almost exactly the same as an item someone has previously purchased or and item that looks similar but is actually not suitable for the customer (e.g. a mens shirt when somebody is looking for a blouse)
The ALS method is also very useful however one of the challenges when training the model is we try to decrease the sparsity of the dataset by filtering customers and items that fall below a certain threshold, so not every customer or item is included in the model.

So the here are the steps we’ll go through when creating the recommendations with the features we’ve created, each of which I’ll go into in more detail in the following sections:

For each customer, find all of their purchases and then retrieve the most similar products based on their images.
Use a Cypher query to filter this list of similar items to only include products from a department and index that the customer had previously purchased from - this solves the problem of not recommending men’s jeans to women and vice versa although reduces the number of items
Now we have a list of potential items for each customer, use the ALS model to rank this list and take the top 12 items from the ranked list.
For those who have less than 12 recommended items (at this point it’s most likely because the items were filtered out during the training of the ALS model), take a random sample of those items returned in step 2.
Finally, for those who have no recommendations through the above process, use a cypher query to return the 12 most popular products purchased by other users who also purchased the same product as the customer.

So let’s get started…

Getting the process started

Let’s begin with some of the groundwork.

First we need to create an empty sparse matrix of all users and products, and mappings from the customer_id and product_id to their relevant indices in the sparse matrix.

import pandas as pd
import numpy as np
from neo4j import GraphDatabase
from neo4j.exceptions import ServiceUnavailable
from scipy import sparse
import datetime as dt
import logging
import sys

uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "password"))

# Get all customer ID's and all products ID's to create sparse matrix

with driver.session() as session:
    query = """
    MATCH (c:Customer)
    RETURN c.id as id
    """

    results = session.run(query)
    customers = [record['id'] for record in results]


    query = """
    MATCH (p:Product)
    RETURN p.code as id
    """

    results = session.run(query)
    products = [record['id'] for record in results]

# Create mappings for sparse matrix
all_product_to_idx = {}
all_idx_to_product = {}
for (idx, prod_id) in enumerate(products):
    all_product_to_idx[prod_id] = idx
    all_idx_to_product[idx] = prod_id

all_customer_to_idx = {}
all_idx_to_customer = {}
for (idx, cust_id) in enumerate(customers):
    all_customer_to_idx[cust_id] = idx
    all_idx_to_customer[idx] = cust_id

# Create lil matrix to incrementally build sparse matrix of recs
n_users = len(customers)
n_products = len(products)

rec_matrix = sparse.lil_matrix((n_users, n_products))

I’ve found sparse.lil_matrix incredibly useful in this process for assigning values to a sparse matrix incrementally, It can then be converted to a csr matrix later in the process for more efficient operations.

Creating a Neo4j Class

Next we need to create a class to connect to our Neo4j database and retrieve our recommendations.

class NeoUserRecommender:

    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        # Don't forget to close the driver connection when you are finished with it
        self.driver.close()

    @staticmethod
    def enable_log(level, output_stream):
        handler = logging.StreamHandler(output_stream)
        handler.setLevel(level)
        logging.getLogger("neo4j").addHandler(handler)
        logging.getLogger("neo4j").setLevel(level)

    def get_recommended_items_for_user(self, customer_id):
        with self.driver.session(database='fabric') as session:
            result = session.read_transaction(
                self._get_recommended_items_for_user, customer_id)
            return result[0]

    @staticmethod
    def _get_recommended_items_for_user(tx, customer_id):

        query = ("""
        CALL {
        // Return all purchases for a single customer
        USE fabric.neo4j
        MATCH (c:Customer)-[pur:PURCHASED]->(p:Product)
        WHERE  c.id = $customer_id
        RETURN p.code as prod_code
        }
        CALL {
        // For each of those purchases, return the 10 most similar
        // by image embedding
        USE fabric.products
        WITH prod_code
        MATCH (p1:Product)-[:SIMILAR]->(rec:Product)
        WHERE p1.code = prod_code
        RETURN rec.code as rec_code
        }
        CALL {
        // For each of these recommendations, filter out the ones
        // that don't come from a department or index previously
        // purchases from by the customer
        USE fabric.neo4j
        with rec_code
        MATCH (c:Customer)
        WHERE c.id = $customer_id
        MATCH (ind:Index)<-[:HAS_INDEX]-(rec:Product)-[:FROM_DEPARTMENT]->(d:Department)
        WHERE rec.code = rec_code
        AND EXISTS ((c)-[:PURCHASED]->(:Product)-[:FROM_DEPARTMENT]->(d))
        AND EXISTS ((c)-[:PURCHASED]->(:Product)-[:HAS_INDEX]->(ind))
        RETURN rec.code AS recommended_item
        }

        RETURN collect(recommended_item) as recommended_items
        """
        )
        result = tx.run(query, customer_id=customer_id)
        try:
            return [row['recommended_items'] for row in result]
        # Capture any errors along with the query and data for traceability
        except ServiceUnavailable as exception:
            logging.error(f"{query} raised an error: \n {exception}")
            raise


    def get_cypher_recs_for_user(self, customer_id):
        with self.driver.session(database='neo4j') as session:
            result = session.read_transaction(
                self._get_cypher_recs_for_user, customer_id)
            return result[0]

    @staticmethod
    def _get_cypher_recs_for_user(tx, customer_id):

        query = """
        MATCH (c:Customer {id: $customer_id})-[:PURCHASED]->(p:Product)<-[:PURCHASED]-(:Customer)-[:PURCHASED]->(rec:Product)
        WHERE id(p) <> id(rec)
        AND NOT EXISTS ((c)-[:PURCHASED]->(rec))
        WITH c.id as customer_id, rec, COUNT(rec) as score ORDER BY COUNT(rec) DESC LIMIT 12
        RETURN collect(rec.code) as recommended_items
        """

        result = tx.run(query, customer_id=customer_id)
        try:
            return [row['recommended_items'] for row in result]
        # Capture any errors along with the query and data for traceability
        except ServiceUnavailable as exception:
            logging.error(f"{query} raised an error: \n {exception}")
            raise

The first transaction function in the class _get_recommended_items_for_user is finding all similar items for a user based off their purchases, filtering items which aren’t from departments or product indexes the customer has purchased from previously and returning a list of these items. _get_cypher_recs_for_user is the function which will be used when we have no similar items for a user based on the above criteria and finds the most popular purchases of other users who purchased a product that our original customer purchased.

Now we’ve done some of the groundwork, we’ll write all these items to our currently empty user/item sparse matrix. This took a while…

neo_recs = NeoUserRecommender(uri, 'neo4j', 'password')

# For each customer, get recommendations and map them to the sparse matrix
for i, cust in enumerate(customers):

    customer_index = customer_to_idx[cust]

    recs = neo_recs.get_recommended_items_for_user(cust)
    rec_indices = [product_to_idx[rec] for rec in recs]

    rec_matrix[customer_index, rec_indices] = 1

    if i%10000 == 0:
        print(f'{i} customers processed')

rec_matrix = rec_matrix.tocsr()
sparse.save_npz('rec_matrix.npz', rec_matrix)

Train the ALS model

We’ll train the ALS model exactly as we did in the previous post using the parameters that scored best on the test set i.e. giving more weight to more recent purchases.

So we now have two sets of mappings from a customer_id and product_id to an index in a sparse matrix:

Mappings starting with all_ reference those for the overall user/item matrix
Mappings starting with als_ references those for the ALS user/item matrix which doesn’t contain all of the customers and products.

Generating The Recommendations

The code below begins to go over the customers one by one and follows the process outlined above, most of the steps are explained in the code below.

# Create a matrix to hold the recommendations
final_recommendations = sparse.lil_matrix((n_users, n_products))

for i, cust in enumerate(customers):

    # Convert the customer_id to the index of all user/items
    all_user_id = all_customer_to_idx[cust]

    # Get all the purchases identified by image similarity and previously written to the user/item matrix
    neo_user_recs = rec_matrix[all_user_id].indices

    # Convert all these product_ids to their respective indicies in the user/item matrix
    neo_recs_product_id = [all_idx_to_product[rec] for rec in neo_user_recs]

    # the dataset used to train the ALS model was filtered by num of purchases and sales
    # so first check that the customer was part of the training set
    if cust in als_customer_to_idx and len(neo_recs_product_id) > 0:

        # Identify the customers index in the als user/item matrix
        als_user_id = als_customer_to_idx[cust]

        # Identify the product index in the als user/item matrix
        als_prod_ids = [als_product_to_idx[int(rec)] for rec in \
         neo_recs_product_id if int(rec) in als_product_to_idx.keys()]

        # Providing at least one of the items was part of the als training data - rank the items and take the top 12
        if len(als_prod_ids) > 0:
            ranked_recs = als.rank_items(als_user_id, purchases_with_confidence, selected_items=als_prod_ids)
            ranked_recs = [x[0] for x in ranked_recs[:12]]
            ranked_recs_prod_ids = [als_idx_to_product[x] for x in ranked_recs]

            # Slight issue with str/int conversion and losing the leading 0
            ranked_recs_all_prod_ids = [all_product_to_idx[f'0{x}'] for x in ranked_recs_prod_ids]

            final_recommendations[all_user_id, ranked_recs] = 1


    # If we didn't find at least 12 recs, find the extra recs to add
    if final_recommendations.getrow(all_user_id).count_nonzero() < 12:


        num_to_insert = 12 - final_recommendations.getrow(all_user_id).count_nonzero()

        if len(rec_matrix[all_user_id].indices) == 0:
            continue

        extra_recs = np.random.choice(rec_matrix[all_user_id].indices, size=num_to_insert)

        final_recommendations[all_user_id, extra_recs] = 1

    if i % 10000 == 0:
        print(f'{i} customers processed')
        print(f'{n_users - i} customers remaining')

For those who didn’t have 12 recommendations from the ALS method, I think taking a random sample of the recommendations based on image similarity is a reasonable way to create the extra recommendations based on the fact that these recs have already been filtered down to be more suitable through the use of the cypher query we set up earlier. So although it’s not perfect, it’s reasonable mechanism in this instance.

Finally, for any customers who had no recommendations at all after the previous steps, out last method is to use a pure cypher query to get the recs.

First though we need to find out which customers these were. The best way I could find when dealing with a sparse matrix was following this explanation. Essentially we need to find any rows where there are no entries for the indptr (check out the scipy.sparse docs for a deeper explanation).

final_recommendations = final_recommendations.tocsr()
still_to_recommend = np.where(np.diff(final_recommendations.indptr) == 0)[0]

To break it down, if we look at final_recommendations.indptr, we’ll see something like [0,12,24,36,36,48...]. If we then us np.diff, we’ll get the difference between consecutive points in this array. Where this difference equals zero is where the row has no entries in our recommendation matrix, therefore these are the users for which we still need to create some recommendations. And this was done as follows:

final_recommendations = final_recommendations.tolil()

for i, cust in enumerate(still_to_recommend):

    cust_id = all_idx_to_customer[cust]

    cypher_recs = neo_recs.get_cypher_recs_for_user(cust_id)
    recs = [all_product_to_idx[x] for x in cypher_recs]

    final_recommendations[cust ,recs] = 1

Conclusion

So there we have it - recommendations based on image embeddings, Cypher queries and collaborative filtering. I’ll be munging the recommendations into a format that can be entered into the Kaggle competition and will be waiting patiently for the cheque for the prize money to come through the door…

Share: Twitter, Facebook