Fine-tuning Pixtral - Multi-modal Vision and Text Model

Explore the process of fine-tuning Pixtral, a multi-modal vision and text model, in this comprehensive tutorial video. Learn about Pixtral's architecture, including its custom image encoder trained from scratch, and follow step-by-step instructions for fine-tuning in a Jupyter notebook. Discover GPU setup requirements, dataset preparation techniques, and advanced chat templating. Gain insights into evaluating baseline performance, setting up LoRA fine-tuning, and optimizing training arguments. Explore methods for merging LoRA adapters, measuring OCR performance, and setting up an API endpoint using vLLM for inference. Access additional resources, including slides, datasets, and code repositories, to enhance your understanding of Pixtral fine-tuning techniques.

Syllabus

How to fine-tune Pixtral.
Video Overview
Pixtral architecture and design choices
Mistral’s custom image encoder - trained from scratch
Fine-tuning Pixtral in a Jupyter notebook
GPU setup for notebook fine-tuning and VRAM requirements
Getting a “transformers” version of Pixtral for fine-tuning
Loading Pixtral
Dataset loading and preparation
Chat templating somewhat advanced, but recommended
Inspecting and evaluating baseline performance on the custom data
Setting up data collation including for multi-turn training.
Training on completions only tricky but improves performance
Setting up LoRA fine-tuning
Setting up training arguments batch size, learning rate, gradient checkpointing
Setting up tensor board
Evaluating the trained model
Merging LoRA adapters and pushing the model to hub
Measuring performance on OCR optical character recognition
Inferencing Pixtral with vLLM, setting up an API endpoint
Video resources