arXiv 2507.17080

VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings

By Ramin Giahi, Kehui Yao, et al.

Published 2025-07-22

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance;…

View the original paper on arXiv