Build an AI Document Processing Pipeline for RAG - OCR, Image to Text, VLM, Chunking

This 26-minute tutorial walks you through building a complete AI document processing pipeline locally for RAG applications. Learn to convert PDFs, perform OCR, extract text from images using Vision Language Models, implement semantic chunking with LLMs, and add contextual enrichment. The step-by-step guide demonstrates using tools like Docling for document processing and Ollama with Gemma 3 for intelligent chunking. Follow along as the instructor processes a sample NVIDIA financial report, visually inspects the results, handles image annotations, and tests the pipeline with a simple RAG implementation. Access the full tutorial with source code through MLExpert Pro, and explore additional resources including the GitHub repository and evaluation methods for chunking strategies.

Syllabus

00:00 - Welcome
01:01 - Document processing pipeline
02:07 - Full-text tutorial and source code on MLExpert.io
02:41 - Docling
03:53 - PDF document sample
04:38 - Notebook setup
05:45 - PDF to Markdown OCR, layout analysis, image to text
08:45 - Visual inspection
11:02 - Image annotations
14:37 - Chunking with Ollama and Gemma 3
19:58 - Contextual enrichment retrieval
21:50 - Test the pipeline with simple RAG
24:42 - Conclusion