Fashion Search using Composed Image Retrieval (CIR)

Fashion Search using Composed Image Retrieval (CIR)

Grad-cam image of a shoe
Grad-cam image of a shoe
Grad-cam image of a shoe

With fashion trends changing in the blink of an eye and online inventories scattered across countless retailers, finding the perfect item feels like searching for a needle in a haystack. This challenge, amplified by the rise of "fast fashion", leaves shoppers scrolling endlessly, struggling to match their evolving tastes with the right products.

Composed Image Retrieval (CIR) is an innovative approach that redefines how we search for fashion online. Traditional search tools often fall short, relying solely on keywords or tagged images, leaving little room for personalization or creativity. CIR expands on this by blending visual and textual inputs: Imagine showing an image of a black dress and saying, "Make it blue with a V-neck". CIR understands this combination, retrieving precisely what you envisioned.

By unifying visual and textual data, CIR bridges the gap between the shopper's imagination and available inventory, turning the "messy middle" of discovery into a seamless, intuitive experience. As Moonsift explores smarter, AI-driven ways to tackle product discovery challenges, CIR is one of these transformative AI tools, promising a future where the perfect match is within easy reach.

The Tech Behind CIR: Multimodal Fusion

At the heart of CIR is multimodal fusion, a sophisticated process that unifies visual and textual data into a shared "language" for machines to understand. Powered by Machine Learning models such as CLIP (Contrastive Language-Image Pretraining), CIR maps images and descriptive text into a common embedding space, enabling it to compare and combine these inputs seamlessly as lists of numbers. Curious at how CLIP works? Check out our article Challenges in AI-Driven Product Discovery in a ‘Cross-Retail’ Setting.

Example of Composed Image Retrieval (CIR) process

Using the diagram above, here's how CIR works: a source image is paired with a user-provided modification — such as "have ankle straps" — and both are processed through specialized encoders. The visual and textual features are then fused, either through linear or non-linear methods, creating a unified query that guides the system to retrieve the ideal match. This technique ensures that CIR doesn't just retrieve visually similar items but can also adapt to intricate, user-specified changes.

Moonsift has been exploring similar multimodal capabilities in its AI-powered Shopping Copilot, leveraging tools similar to CLIP to enhance search precision across diverse inventories among different retailers. By bridging the gap between images and words, CIR helps ensures that every query is met with results that truly resonate.

The Potential of CIR

The Composed Image Retrieval (CIR) system underwent rigorous evaluation using standardised datasets FashionIQ and Shoes, providing key insights into its ability to revolutionize fashion search.

Raw-Data Level Unification

Different combinations for raw-data level multimodal (visual and textual) query unification

To explore the vast potential of embedding spaces, Raw-Data Level Unification is one of the innovative techniques of CIR, where experiments combine textual and visual data at the raw-data level as shown in the above diagram, creating unified queries that seamlessly blend user intent with source image attributes. Using the BLIP-2 image captioning model, textual unification merges the modification text with a description of the source image, retaining full semantic context—ideal for complex search requests. Meanwhile, visual unification integrates key modification words, extracted via a Large Language Model, directly into the source image, preserving its visual essence while embedding crucial target attributes by leveraging CLIP’s transformer-based OCR capabilities.

The result? A shared query language that forms the foundation for CIR’s precision. This unification not only enhances accuracy but serves as a launchpad for advanced visualizations such as Grad-CAM and embedding navigation (see below).

Grad-CAM Heatmaps: Visualizing Attention

Below is a Triplet Example of GradCAM heatmap visualisation for different unification combinations on source and target images of Shoes dataset computed by the gradients of the CLIP model with respect to its final retrieval embedding output:

Triplet Example of GradCAM heatmap visualisation for different unification combinations on source and target images of Shoes dataset computed by the gradients of the CLIP model with respect to its final retrieval embedding output

With Grad-CAM, CIR’s decision-making process is brought to life. The heatmaps reveal how CIR pinpoints relevant features in source and target images based on user queries. Here Grad-CAM highlights areas like ankle straps or open-toe features when refining queries for footwear.

Integrating both unified visual and textual queries significantly improved the model’s ability to capture and focus on the necessary features of the target image, as evidenced by the more targeted attention in the GradCAM visualizations and higher recall scores compared to other models. This visualization proves that CIR doesn’t just search — it truly understands and adapts to the user’s intent.

Linear vs. Non-Linear Fusion

Below is an example of a GradCAM heatmap visualization for linear and non-linear fusion models on source and target images computed by the gradients of the CLIP model with respect to its final retrieval embedding output.

Example of GradCAM heatmap visualization for linear and non-linear fusion models on source and target images computed by the gradients of the CLIP model with respect to its final retrieval embedding output

This demonstrates how CIR employs linear and non-linear fusion to integrate textual and visual features, adapting to varying levels of query complexity.

Linear fusion, as defined below, combines the textual and visual features using a weighted sum:

Linear Fusion equation

The above example illustrates how linear fusion handles the query "have ankle straps" for a red high heel. The model efficiently aligns the query with the visual data, focusing on the ankle strap region without excessive computation.

For more intricate queries, non-linear fusion introduces additional complexity with transformations through a Combiner network:

Non-Linear Fusion equation

Here, the additional term represents non-linear outputs from activation and dropout layers, allowing the model to capture richer relationships. The example shows how the model adapts to "add ankle straps" by honing in on the straps while balancing other shoe features, demonstrating its ability to navigate nuanced modifications.

However, this added depth can sometimes introduce noise, as seen in the heatmaps, where attention may slightly shift away from the user’s primary intent. GradCAM visualizations help identify these focus areas, offering insights into the model’s reasoning and ensuring explainability.

By combining the efficiency of linear fusion with the precision of non-linear fusion, CIR adapts to a wide range of queries. These methods ensure CIR can handle both straightforward requests like colour adjustments and complex queries involving subtle design changes. This aligns with Moonsift’s goal of creating flexible, user-centric discovery tools, blending speed and depth for a seamless shopping experience.

Navigating the Embedding Space

t-SNE Visualization of Embeddings: This figure illustrates the embedding space where different elements of the CIR process are projected onto a 2D space, showing how the system navigates from the source image and modification text to the target image.

To unlock the full potential of CIR, navigating the embedding space is key. Using 2-dimonsional visualizations in the above diagram, we demonstrate how raw-data level query unification transforms embeddings to bring them closer to the target image. By unifying the modification text and source image into unified textual and visual queries, the system effectively narrows the gap in the embedding space.

Pushing the Boundaries of AI in Fashion

Composed Image Retrieval (CIR) is a reimagination of how users interact with fashion online. By combining visual and textual inputs, CIR transforms the shopping journey, enabling users to find exactly what they’re looking for. Whether it’s modifying a dress to “make it blue with a higher neckline” or searching for shoes with “ankle straps and a glossy finish”, CIR delivers a personalized, intuitive experience. This innovation holds immense potential for e-commerce platforms such as Moonsift, bridging the gap between user imagination and actual product discovery.

Like any cutting-edge innovation, CIR faces challenges in its adoption. The computational demands of its models are significant, and aligning datasets to account for varying product descriptions and visuals is challenging. These limitations highlight opportunities for future research—optimizing model efficiency, enhancing dataset standardization, and exploring novel architectures to improve scalability and accuracy.

Moonsift’s cross-retailer discovery tools unify product inventories from diverse retailers into a single, seamless search experience. CIR’s multimodal capabilities are already allowing shoppers to effortlessly navigate across brands and retailers, adjusting the style or attributes on items as they desire. We believe CIR will not only make shopping more intuitive but also set a new standard for how users interact with technology in the fashion space, one query at a time.

This work was undertaken as part of a Masters project of Toby Wong at the University of Edinburgh cosupervised by Dr. Pavlos Andreadis (Lecturer, Informatics) and Dr. David Wood (CTO, Moonsift)

With fashion trends changing in the blink of an eye and online inventories scattered across countless retailers, finding the perfect item feels like searching for a needle in a haystack. This challenge, amplified by the rise of "fast fashion", leaves shoppers scrolling endlessly, struggling to match their evolving tastes with the right products.

Composed Image Retrieval (CIR) is an innovative approach that redefines how we search for fashion online. Traditional search tools often fall short, relying solely on keywords or tagged images, leaving little room for personalization or creativity. CIR expands on this by blending visual and textual inputs: Imagine showing an image of a black dress and saying, "Make it blue with a V-neck". CIR understands this combination, retrieving precisely what you envisioned.

By unifying visual and textual data, CIR bridges the gap between the shopper's imagination and available inventory, turning the "messy middle" of discovery into a seamless, intuitive experience. As Moonsift explores smarter, AI-driven ways to tackle product discovery challenges, CIR is one of these transformative AI tools, promising a future where the perfect match is within easy reach.

The Tech Behind CIR: Multimodal Fusion

At the heart of CIR is multimodal fusion, a sophisticated process that unifies visual and textual data into a shared "language" for machines to understand. Powered by Machine Learning models such as CLIP (Contrastive Language-Image Pretraining), CIR maps images and descriptive text into a common embedding space, enabling it to compare and combine these inputs seamlessly as lists of numbers. Curious at how CLIP works? Check out our article Challenges in AI-Driven Product Discovery in a ‘Cross-Retail’ Setting.

Example of Composed Image Retrieval (CIR) process

Using the diagram above, here's how CIR works: a source image is paired with a user-provided modification — such as "have ankle straps" — and both are processed through specialized encoders. The visual and textual features are then fused, either through linear or non-linear methods, creating a unified query that guides the system to retrieve the ideal match. This technique ensures that CIR doesn't just retrieve visually similar items but can also adapt to intricate, user-specified changes.

Moonsift has been exploring similar multimodal capabilities in its AI-powered Shopping Copilot, leveraging tools similar to CLIP to enhance search precision across diverse inventories among different retailers. By bridging the gap between images and words, CIR helps ensures that every query is met with results that truly resonate.

The Potential of CIR

The Composed Image Retrieval (CIR) system underwent rigorous evaluation using standardised datasets FashionIQ and Shoes, providing key insights into its ability to revolutionize fashion search.

Raw-Data Level Unification

Different combinations for raw-data level multimodal (visual and textual) query unification

To explore the vast potential of embedding spaces, Raw-Data Level Unification is one of the innovative techniques of CIR, where experiments combine textual and visual data at the raw-data level as shown in the above diagram, creating unified queries that seamlessly blend user intent with source image attributes. Using the BLIP-2 image captioning model, textual unification merges the modification text with a description of the source image, retaining full semantic context—ideal for complex search requests. Meanwhile, visual unification integrates key modification words, extracted via a Large Language Model, directly into the source image, preserving its visual essence while embedding crucial target attributes by leveraging CLIP’s transformer-based OCR capabilities.

The result? A shared query language that forms the foundation for CIR’s precision. This unification not only enhances accuracy but serves as a launchpad for advanced visualizations such as Grad-CAM and embedding navigation (see below).

Grad-CAM Heatmaps: Visualizing Attention

Below is a Triplet Example of GradCAM heatmap visualisation for different unification combinations on source and target images of Shoes dataset computed by the gradients of the CLIP model with respect to its final retrieval embedding output:

Triplet Example of GradCAM heatmap visualisation for different unification combinations on source and target images of Shoes dataset computed by the gradients of the CLIP model with respect to its final retrieval embedding output

With Grad-CAM, CIR’s decision-making process is brought to life. The heatmaps reveal how CIR pinpoints relevant features in source and target images based on user queries. Here Grad-CAM highlights areas like ankle straps or open-toe features when refining queries for footwear.

Integrating both unified visual and textual queries significantly improved the model’s ability to capture and focus on the necessary features of the target image, as evidenced by the more targeted attention in the GradCAM visualizations and higher recall scores compared to other models. This visualization proves that CIR doesn’t just search — it truly understands and adapts to the user’s intent.

Linear vs. Non-Linear Fusion

Below is an example of a GradCAM heatmap visualization for linear and non-linear fusion models on source and target images computed by the gradients of the CLIP model with respect to its final retrieval embedding output.

Example of GradCAM heatmap visualization for linear and non-linear fusion models on source and target images computed by the gradients of the CLIP model with respect to its final retrieval embedding output

This demonstrates how CIR employs linear and non-linear fusion to integrate textual and visual features, adapting to varying levels of query complexity.

Linear fusion, as defined below, combines the textual and visual features using a weighted sum:

Linear Fusion equation

The above example illustrates how linear fusion handles the query "have ankle straps" for a red high heel. The model efficiently aligns the query with the visual data, focusing on the ankle strap region without excessive computation.

For more intricate queries, non-linear fusion introduces additional complexity with transformations through a Combiner network:

Non-Linear Fusion equation

Here, the additional term represents non-linear outputs from activation and dropout layers, allowing the model to capture richer relationships. The example shows how the model adapts to "add ankle straps" by honing in on the straps while balancing other shoe features, demonstrating its ability to navigate nuanced modifications.

However, this added depth can sometimes introduce noise, as seen in the heatmaps, where attention may slightly shift away from the user’s primary intent. GradCAM visualizations help identify these focus areas, offering insights into the model’s reasoning and ensuring explainability.

By combining the efficiency of linear fusion with the precision of non-linear fusion, CIR adapts to a wide range of queries. These methods ensure CIR can handle both straightforward requests like colour adjustments and complex queries involving subtle design changes. This aligns with Moonsift’s goal of creating flexible, user-centric discovery tools, blending speed and depth for a seamless shopping experience.

Navigating the Embedding Space

t-SNE Visualization of Embeddings: This figure illustrates the embedding space where different elements of the CIR process are projected onto a 2D space, showing how the system navigates from the source image and modification text to the target image.

To unlock the full potential of CIR, navigating the embedding space is key. Using 2-dimonsional visualizations in the above diagram, we demonstrate how raw-data level query unification transforms embeddings to bring them closer to the target image. By unifying the modification text and source image into unified textual and visual queries, the system effectively narrows the gap in the embedding space.

Pushing the Boundaries of AI in Fashion

Composed Image Retrieval (CIR) is a reimagination of how users interact with fashion online. By combining visual and textual inputs, CIR transforms the shopping journey, enabling users to find exactly what they’re looking for. Whether it’s modifying a dress to “make it blue with a higher neckline” or searching for shoes with “ankle straps and a glossy finish”, CIR delivers a personalized, intuitive experience. This innovation holds immense potential for e-commerce platforms such as Moonsift, bridging the gap between user imagination and actual product discovery.

Like any cutting-edge innovation, CIR faces challenges in its adoption. The computational demands of its models are significant, and aligning datasets to account for varying product descriptions and visuals is challenging. These limitations highlight opportunities for future research—optimizing model efficiency, enhancing dataset standardization, and exploring novel architectures to improve scalability and accuracy.

Moonsift’s cross-retailer discovery tools unify product inventories from diverse retailers into a single, seamless search experience. CIR’s multimodal capabilities are already allowing shoppers to effortlessly navigate across brands and retailers, adjusting the style or attributes on items as they desire. We believe CIR will not only make shopping more intuitive but also set a new standard for how users interact with technology in the fashion space, one query at a time.

This work was undertaken as part of a Masters project of Toby Wong at the University of Edinburgh cosupervised by Dr. Pavlos Andreadis (Lecturer, Informatics) and Dr. David Wood (CTO, Moonsift)