TripletCLIP :

Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Arizona State University; University of Maryland, Baltimore County

NeurIPS 2024

Dataset

We generate a synthetic dataset to counter the lack of compositional diversity in CC3M and CC12M by complimenting the dataset with hard negative captions and corresppnding negative images.

Performance

Composition evaluations of the methods on SugarCrepe benchmark.



Zero-shot image-text retrieval and classification results.



Ablation on filtering high-quality image-text pairs from TripletData.



What's holding back the CLIP models? Ablation w.r.t. frozen modality encoders.

Relevant Projects

ECLIPSE (CVPR'24)

A Resource-Efficient Text-to-Image Prior for Image Generations

WOUAF (CVPR'24)

Weight Modulation for User Attribution and Fingerprinting in T2I Models.