arXiv 2507.17080
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
By Ramin Giahi, Kehui Yao, et al.
Published 2025-07-22
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance;…