Challenges in AI-Driven Product Discovery in a ‘Cross-Retail’ Setting

People are standing in front of a large screen displaying product options

When it comes to online shopping, Search and filter are the primary tools customers use to find products they are actually interested in buying among the millions of options available. Customers broadly fit into two categories when researching what to buy online:

Looking for a specific item eg. “iPhone 15 pro max”
Discovery eg. “A new swimsuit for my holiday to Bali”

Traditional keyword algorithms (BM25, TF-IDF) work well in the first case as the user knows which words will likely surface their results. However the second case relies heavily on conceptual understanding and context.

We’ve discussed before that we believe that an important part of the solution to the second case is modelling semantic relationships between products. In this article we’d like to take a deeper look at some of the details and challenges faced when applying AI to ecommerce. We’ll also discuss why feeding our machine learning models with products for all retailers on the internet makes the problem significantly more complex.

Which AI model architecture is best suited to embedding product data?

In order to apply any modern machine learning techniques to our product data we first need to embed it, that is to convert it to a standardised representation that uses a small amount of computer memory. This embedding representation, which is just a series of numbers (aka a vector), is the output of the first machine learning model we need to choose.

To understand a product completely, it's important to look at both its image and textual information. We need a model that can recognize the connection between images and their descriptions. One such model is CLIP, developed by OpenAI, which stands for "Contrastive Language–Image Pre-training". It is a multi-modal model, meaning that it can take as inputs a product’s textual description, image or both and create embeddings that are relatable to one another (embedded in the same shared “semantic space”).

How does CLIP work?

CLIP has a neural network model architecture and is trained on a vast dataset of pairs of images with their corresponding textual descriptions. When an input, whether textual or visual, is provided, CLIP passes it through the neural network to create the embedding – we say it has been projected into the shared embedding space.

When the CLIP neural network was trained, its objective was for the training image and text pairs to be similar in embedding space. Remember that an embedding is just a series of numbers. And it’s easy for computers to calculate the similarity between numbers. The only thing now is to ensure that the training procedure and model parameters work in such a way that similar pairs are also close together in the embedding space. That’s taken care of by using encoder neural networks as part of the model that have been proven to extract meaning from text and images, and employing the experience of machine learning engineers who train the models in setting training hyperparameters!

It’s important to note that CLIP was trained on a huge dataset of natural language textual descriptions of images, rather than images manually labelled with single word attributes for a specific classification task. This allows the embeddings it creates to be applied to many tasks without the need for specific fine-tuning as well as tasks that deal with more natural language.

Crucially, for us, if we embed many product images and descriptions using CLIP then we are able to query it easily for relevant products. For example, if we want to find a ‘straw summer hat’ all we need to do is embed that query and then look for similar embeddings. Similarly we could start with an image input to find similar straw hats.

Hopefully it’s clear that multi-modal models are powerful tools for enabling search, and hence product discovery. But they are only just beginning to be deployed commercially, and there are some good reasons for that:

Resource intensive: Multi-modal embedding models like CLIP require vast amounts of data and computational resources for training. The original CLIP (OpenAI) was trained on 400m image-text pairs but since then much larger models have been released. Notably, LAION ViT-bigG/14 which was trained on 2 billion image-text pairs.
Overly general pre-training: CLIP and similar models are generally pre-trained on very diverse datasets and while this is good for getting a general understanding of the world it may result in a model that lacks domain specific nuance.
- Specifically in fashion ecommerce, the model may not have seen enough specific cases of styles and trends to truly understand them or recognise them in new contexts.
Data Alignment: Ensuring that the textual and visual data are correctly aligned is crucial. Misalignment can lead to poor model performance and inaccurate embeddings.
- For example: Doc Martens codenames for their shoes like “1460” are awful descriptions of the actual products they represent.
Latency Issues: Real-time search and recommendations are crucial. However these large models often take quite long to perform inference due to the relatively slow multiplication of huge parameter matrices in the neural networks.

'Cross-retail' data

By 'cross-retail', we mean product data from across many retailers (tens of thousands), each with their own set of product offerings, data formats, and customer flows. This multi-source approach presents several challenges for modelling.

Why is using AI with cross-retail product data so challenging?

At a high level, AI / Machine Learning models are trained to recognise patterns in data. When you’re focusing on a single retailer, with their own method of presenting their data / product offering, you can invest time into creating a custom preprocessing solution that will help the model understand that retailer's specific case. However, when trying to build a model that understands all products on the internet you’re not afforded that luxury. Here’s just the tip of the iceberg of challenges that we’ve encountered when dealing with cross-retail data at scale:

Different data formats across retailers: Each retailer has its own way of categorising, describing, and presenting products. And while there are trends, standardising this data for AI analysis can be a daunting task.
- At Moonsift we’ve done years of R&D and analysed thousands of retailers to better understand this problem.
Inconsistent Data Formats within Retailers: even within a single retailer data might be presented in varied formats, making collection and standardisation even more challenging.
Dynamic Nature of Retail: Prices, stock availability, images, and product descriptions can change frequently. Keeping AI models updated with this dynamic data is a continuous challenge.
Volume and Scalability: The sheer amount of product data across tens of thousands of retailers can be immense. Handling and processing this data in real-time requires scalable AI solutions.
Low Quality Data and Spam: The quality of product images, and descriptions can vary widely among retailers. This inconsistency can affect the accuracy and quality of the embeddings created by AI. Malicious keyword stuffing techniques (typically found on marketplaces like Amazon) also negatively impact embedding quality.

Final thoughts

The path to flawless AI-driven product discovery is paved with many hurdles. Nevertheless, with dedicated research and continuous refinement, the potential for a seamless AI-aided shopping experience across multiple retailers is on the horizon. At Moonsift we’re diving deep into these challenges as we build your Shopping Copilot for the entire internet.

When it comes to online shopping, Search and filter are the primary tools customers use to find products they are actually interested in buying among the millions of options available. Customers broadly fit into two categories when researching what to buy online:

Looking for a specific item eg. “iPhone 15 pro max”
Discovery eg. “A new swimsuit for my holiday to Bali”

Traditional keyword algorithms (BM25, TF-IDF) work well in the first case as the user knows which words will likely surface their results. However the second case relies heavily on conceptual understanding and context.

We’ve discussed before that we believe that an important part of the solution to the second case is modelling semantic relationships between products. In this article we’d like to take a deeper look at some of the details and challenges faced when applying AI to ecommerce. We’ll also discuss why feeding our machine learning models with products for all retailers on the internet makes the problem significantly more complex.

Which AI model architecture is best suited to embedding product data?

In order to apply any modern machine learning techniques to our product data we first need to embed it, that is to convert it to a standardised representation that uses a small amount of computer memory. This embedding representation, which is just a series of numbers (aka a vector), is the output of the first machine learning model we need to choose.

To understand a product completely, it's important to look at both its image and textual information. We need a model that can recognize the connection between images and their descriptions. One such model is CLIP, developed by OpenAI, which stands for "Contrastive Language–Image Pre-training". It is a multi-modal model, meaning that it can take as inputs a product’s textual description, image or both and create embeddings that are relatable to one another (embedded in the same shared “semantic space”).

How does CLIP work?

CLIP has a neural network model architecture and is trained on a vast dataset of pairs of images with their corresponding textual descriptions. When an input, whether textual or visual, is provided, CLIP passes it through the neural network to create the embedding – we say it has been projected into the shared embedding space.

When the CLIP neural network was trained, its objective was for the training image and text pairs to be similar in embedding space. Remember that an embedding is just a series of numbers. And it’s easy for computers to calculate the similarity between numbers. The only thing now is to ensure that the training procedure and model parameters work in such a way that similar pairs are also close together in the embedding space. That’s taken care of by using encoder neural networks as part of the model that have been proven to extract meaning from text and images, and employing the experience of machine learning engineers who train the models in setting training hyperparameters!

It’s important to note that CLIP was trained on a huge dataset of natural language textual descriptions of images, rather than images manually labelled with single word attributes for a specific classification task. This allows the embeddings it creates to be applied to many tasks without the need for specific fine-tuning as well as tasks that deal with more natural language.

Crucially, for us, if we embed many product images and descriptions using CLIP then we are able to query it easily for relevant products. For example, if we want to find a ‘straw summer hat’ all we need to do is embed that query and then look for similar embeddings. Similarly we could start with an image input to find similar straw hats.

Hopefully it’s clear that multi-modal models are powerful tools for enabling search, and hence product discovery. But they are only just beginning to be deployed commercially, and there are some good reasons for that:

Resource intensive: Multi-modal embedding models like CLIP require vast amounts of data and computational resources for training. The original CLIP (OpenAI) was trained on 400m image-text pairs but since then much larger models have been released. Notably, LAION ViT-bigG/14 which was trained on 2 billion image-text pairs.
Overly general pre-training: CLIP and similar models are generally pre-trained on very diverse datasets and while this is good for getting a general understanding of the world it may result in a model that lacks domain specific nuance.
- Specifically in fashion ecommerce, the model may not have seen enough specific cases of styles and trends to truly understand them or recognise them in new contexts.
Data Alignment: Ensuring that the textual and visual data are correctly aligned is crucial. Misalignment can lead to poor model performance and inaccurate embeddings.
- For example: Doc Martens codenames for their shoes like “1460” are awful descriptions of the actual products they represent.
Latency Issues: Real-time search and recommendations are crucial. However these large models often take quite long to perform inference due to the relatively slow multiplication of huge parameter matrices in the neural networks.

'Cross-retail' data

By 'cross-retail', we mean product data from across many retailers (tens of thousands), each with their own set of product offerings, data formats, and customer flows. This multi-source approach presents several challenges for modelling.

Why is using AI with cross-retail product data so challenging?

At a high level, AI / Machine Learning models are trained to recognise patterns in data. When you’re focusing on a single retailer, with their own method of presenting their data / product offering, you can invest time into creating a custom preprocessing solution that will help the model understand that retailer's specific case. However, when trying to build a model that understands all products on the internet you’re not afforded that luxury. Here’s just the tip of the iceberg of challenges that we’ve encountered when dealing with cross-retail data at scale:

Different data formats across retailers: Each retailer has its own way of categorising, describing, and presenting products. And while there are trends, standardising this data for AI analysis can be a daunting task.
- At Moonsift we’ve done years of R&D and analysed thousands of retailers to better understand this problem.
Inconsistent Data Formats within Retailers: even within a single retailer data might be presented in varied formats, making collection and standardisation even more challenging.
Dynamic Nature of Retail: Prices, stock availability, images, and product descriptions can change frequently. Keeping AI models updated with this dynamic data is a continuous challenge.
Volume and Scalability: The sheer amount of product data across tens of thousands of retailers can be immense. Handling and processing this data in real-time requires scalable AI solutions.
Low Quality Data and Spam: The quality of product images, and descriptions can vary widely among retailers. This inconsistency can affect the accuracy and quality of the embeddings created by AI. Malicious keyword stuffing techniques (typically found on marketplaces like Amazon) also negatively impact embedding quality.

Final thoughts

The path to flawless AI-driven product discovery is paved with many hurdles. Nevertheless, with dedicated research and continuous refinement, the potential for a seamless AI-aided shopping experience across multiple retailers is on the horizon. At Moonsift we’re diving deep into these challenges as we build your Shopping Copilot for the entire internet.

All Guides

Start your online Wishlist

All Guides

Start your online Wishlist

Challenges in AI-Driven Product Discovery in a ‘Cross-Retail’ Setting

Challenges in AI-Driven Product Discovery in a ‘Cross-Retail’ Setting

Which AI model architecture is best suited to embedding product data?

How does CLIP work?

'Cross-retail' data

Why is using AI with cross-retail product data so challenging?

Final thoughts

Which AI model architecture is best suited to embedding product data?

How does CLIP work?

'Cross-retail' data

Why is using AI with cross-retail product data so challenging?

Final thoughts

The Agent-API Bottleneck

What OpenAI’s Dev Day Announcements mean for the Agent / Copilot market

Mapping shopping journeys with the help of AI

Product discovery: shoppers' unsolved problem