Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Fine-tuning Multimodal Embeddings for Custom Text-Image Pairs Using CLIP

Shaw Talebi via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to fine-tune CLIP (Contrastive Language-Image Pre-training) models on custom text-image pairs through a detailed video tutorial that walks through the process using YouTube titles and thumbnails. Master the implementation of multimodal embeddings using the Sentence Transformers Python library through a step-by-step approach covering data gathering, preprocessing, evaluation definition, model fine-tuning, and performance assessment. Access comprehensive resources including a detailed blog post, GitHub repository with code examples, pre-trained model on Hugging Face, and the complete dataset used for training. Explore practical applications of zero-shot learning, understand CLIP's limitations, and discover how to overcome them through custom fine-tuning for specific use cases.

Syllabus

Intro - 0:00
Multimodal Embeddings - 0:44
0-shot Use Cases - 2:30
Limitations of CLIP - 3:50
Fine-tuning CLIP - 5:14
Step 1: Gather training data - 6:46
Step 2: Preprocess data - 15:20
Step 3: Define evals - 17:20
Step 4: Fine-tune model - 19:22
Step 5: Evaluate model - 26:04

Taught by

Shaw Talebi

Reviews

Start your review of Fine-tuning Multimodal Embeddings for Custom Text-Image Pairs Using CLIP

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.