Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Fine-tuning Pixtral - Multi-modal Vision and Text Model

Trelis Research via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Explore the process of fine-tuning Pixtral, a multi-modal vision and text model, in this comprehensive tutorial video. Learn about Pixtral's architecture, including its custom image encoder trained from scratch, and follow step-by-step instructions for fine-tuning in a Jupyter notebook. Discover GPU setup requirements, dataset preparation techniques, and advanced chat templating. Gain insights into evaluating baseline performance, setting up LoRA fine-tuning, and optimizing training arguments. Explore methods for merging LoRA adapters, measuring OCR performance, and setting up an API endpoint using vLLM for inference. Access additional resources, including slides, datasets, and code repositories, to enhance your understanding of Pixtral fine-tuning techniques.

Syllabus

How to fine-tune Pixtral.
Video Overview
Pixtral architecture and design choices
Mistral’s custom image encoder - trained from scratch
Fine-tuning Pixtral in a Jupyter notebook
GPU setup for notebook fine-tuning and VRAM requirements
Getting a “transformers” version of Pixtral for fine-tuning
Loading Pixtral
Dataset loading and preparation
Chat templating somewhat advanced, but recommended
Inspecting and evaluating baseline performance on the custom data
Setting up data collation including for multi-turn training.
Training on completions only tricky but improves performance
Setting up LoRA fine-tuning
Setting up training arguments batch size, learning rate, gradient checkpointing
Setting up tensor board
Evaluating the trained model
Merging LoRA adapters and pushing the model to hub
Measuring performance on OCR optical character recognition
Inferencing Pixtral with vLLM, setting up an API endpoint
Video resources

Taught by

Trelis Research

Reviews

Start your review of Fine-tuning Pixtral - Multi-modal Vision and Text Model

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.