TL;DR

Multi-modal training aligns vision and language representations. CLIP-style contrastive learning matches images with captions. Enables zero-shot classification, image search, and vision-language tasks.

Training approaches

Contrastive (CLIP-style):

  • Pair images with captions
  • Learn shared embedding space
  • Images and matching text close together

Captioning (encoder-decoder):

  • Vision encoder + language decoder
  • Generate descriptions of images

Visual question answering:

  • Combine vision and language understanding
  • Answer questions about images

Data requirements

  • Image-text pairs (millions needed)
  • Sources: Web scraping, curated datasets
  • Quality matters more than quantity

Architectures

Vision encoders: ViT, ResNet, ConvNeXT
Text encoders: BERT, GPT, T5
Fusion: Cross-attention, adapter layers

Training challenges

  • Computational cost (large datasets, models)
  • Alignment difficulty
  • Modality imbalance

Applications

  • Zero-shot image classification
  • Image search
  • Visual chatbots
  • Content moderation