A deep dive into how we leverage neural networks to create relevant cross-sell recommendations
Have you ever wondered how, as you are shopping online, adding items to your cart, the website seems to suggest products that perfectly complement what you have already bought or added to your cart? This is known as cross-sell: recommending additional items that complement the products you are currently considering.
Traditionally, cross-sell has been done using a method called collaborative filtering or statistical “wisdom of the crowd” models. These methods are great in that they are relatively simple and fast. Unfortunately, they do not work well for new products or categories of products that do not get much traffic, i.e. the long tail, because they cannot make recommendations for products without sufficient view or purchase data. Methods using more abundant data (like views or clicks) or content-based unsupervised methods can reliably provide similar product recommendations, but fail to capture the complex associations between products that make good cross-sell.
Natural Language Processing (NLP) for Cross-Sell
NLP is a branch of linguistics and artificial intelligence that uses machine learning to analyze and convert natural language into useful representations or insights.
At Algonomy, we use NLP in many ways. One of which is in extracting structured information from unstructured texts. For instance, gathering product features or metadata from its text description (Topic for another blog post). But here we will focus on our deep learning NLP cross-sell models, hereafter shortened to “DeepRecs NLP.”
What Data do NLP Models use?
Rather than using Product IDs as input, like a collaborative filtering or statistics based method would, DeepRecs NLP uses natural language and structured metadata, such as product descriptions, reviews, brand or categories. This allows the model to uncover patterns in the language used to describe a product and therefore inform cross-sell. More importantly, it allows the models to reach products that have rarely been viewed or bought, e.g. new or long-tail products. These are products with relatively little purchase (or view) data, as shown in Figure 1.
Figure 1. Plot illustrating product long tail. The majority of products have relatively little purchase data, making it difficult to use traditional recommendation strategies. Purchase count was capped at 1000 for plot clarity, though in some cases the actual purchase count was much higher.
We leverage shopper data to build neural networks (a specific type of machine learning model, broadly encompassed by the field of deep learning). These networks consider larger patterns of shopper behavior by learning from pairs of products that co-occur. For example, products that were purchased by the same user, within a given time window, are “connected.” Based on the number of times a product pair co-occurs across all users, each pair is assigned a score between 0 and 1, with a higher score denoting better cross-sell fit. Specifically,
where is the co-occurrence count of product p2 with product p1.
From these product pairs, a seed product and a recommendation (rec) product is defined. Each will have various information associated with it, such as name, brand, description(s) or reviews. We normalize the text data. From there, based on the input type, we might process input in different ways – more on that after we look at the results our clients are seeing.
Impact on eCommerce Metrics
DeepRecs NLP has consistently shown increases to key metrics our clients care about, such as click-through rate (CTR), average order value (AOV), and revenue per visit (RPV). Of the five clients for which we ran A/B tests comparing slates of recommendation strategies with and without DeepRecs NLP, the treatment side (with DeepRecs NLP) for all five showed a statistically significant increase in CTR. In most cases, DeepRecs NLP also showed an increase to AOV and RPV, with one site obtaining a 7.296% increase to RPV!
Many clients were so convinced with the initial results from DeepRecs NLP that they skipped the A/B test and immediately made it live to all traffic in key places like the Product Detail Page or Add-to-cart pages. In one month alone, DeepRecs NLP provided over 14.5 million viewed recommendations to 20 sites world-wide.
DeepRecs NLP Model Architecture
In our DeepRecs NLP system, we use two identical “stacks” of neural networks, one for seed products and another for rec products. In this section, we detail some of the specifics of our model architecture.
Figure 2. DeepRecs NLP model architecture. Orange boxes represent inputs and grey ovals represent model layers or interchangeable components. The green and gold rectangles represent intermediate seed and recommendation “stack” vectors, respectively.
Each input feature gets encoded in different ways, based on its type. For instance, for discrete categorical data, e.g. brand, we learn a dense “embedding” vector representation for each unique value, or a “multi-hot” embedded vector representation if there are multiple categories(a vector is basically a series of numbers like [0.52, 0.02, 0.97, 0.08, 0.55]). For unstructured text input we might “embed” each word and run them through an Attention layer [1][2]–a mechanism for combining vectors, focusing on important values to get a feature-level representation. Alternatively, we might run the raw text through a pre-trained BERT model [3], followed by one or more fully-connected neural network layers to capture cross-sell specific relationships from the pre-trained model.
The result is that we have vectors of the same size for each input. A benefit of processing inputs in this way is that we can ingest an arbitrary number of inputs, determined by the client. We then combine them all (usually with a weighted sum or another Attention layer) and optionally add more fully-connected layers to get a final product vector.
To produce a score between seed and rec products, we take the cosine similarity of the normalized final product vectors. For training, this similarity score is compared to the score from Equation 1 using a hinge loss. At prediction time, the similarity score can be used to rank pairs of products for cross-sell.
Each of the “usual” model options mentioned above are automatically determined during hyper-parameter optimization based on the complexity of the site catalog and loss metrics on a validation dataset. This allows each of our clients to have a custom, personalized model that is best suited to their catalog.
Mitigating Inherent Model Bias
BERT has been a breakthrough in the world of NLP for many tasks, including our cross-sell recommendation systems. Because they were trained on a lot of data from the internet, pre-trained BERT models can process just about any text in over 100 languages into a meaningful, dense representation that can be used in downstream tasks without the costs of training from scratch.
Unfortunately, training on internet text also means that the models implicitly encode potentially harmful language bias. Research suggests that BERT introduces language bias into systems that use it [4][5][6]. For example, when used in a Natural Language Generation system, it might always associate “Doctor” with male pronouns and “Nurse” with female pronouns.
In the context of cross sell, consider childrenʼs clothing and toys. It is a highly gendered catalog space because that is what society and shoppers reinforce through norms and macro-patterns of purchases. If you add into those interactions the gender bias of BERT, it is not hard to imagine a scenario where NLP cross-sell disproportionately recommends a kitchen play-set or Barbie doll for a unicorn blouse seed, but swim trunks or a backyard science kit for a shark graphic tee seed. Although contrived, this example demonstrates how training data can inject unfair bias into recommendations, ultimately impacting the interests or opportunities of different groups of people. At Algonomy, we are actively researching how large of an issue this is in our models and ways such bias can be mitigated.
Conclusion
Cross sell models that understand industry context (in this case retail) and address domain-specific challenges like catalog long-tail can create high business value. At Algonomy, we leverage deep learning techniques as well as shopper history, patterns and insights that improve relevance of cross-sell, enhance the shopping experience and grow metrics such as AOV and RPS.
References
1. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Empirical Methods in Natural Language Processing (EMNLP).
2. Nikolas Adaloglou. 2020. How Attention works in Deep Learning: understanding the attention mechanism in sequence models. https://theaisummer.com/attention/
3. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
4. Tan, Y. and L. E. Celis. “Assessing Social and Intersectional Biases in Contextualized Word Representations.” NeurIPS (2019).
5. Kurita, Keita et al. “Measuring Bias in Contextualized Word Representations.” Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics.
6. Bender, Emily M. et al. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery.