AI Inference Workloads - Solving MLOps Challenges in Production

Explore the challenges and solutions for AI inference workloads in production environments during this 55-minute conference talk from the Toronto Machine Learning Series. Dive into the complexities of moving machine learning prototypes to production, focusing on throughput, latency, and GPU utilization. Learn about fractional GPU capabilities and their impact on performance. Discover how a leading organization built an inference platform using Kubernetes and NVIDIA A100 MIG technology to scale AI initiatives. Gain insights into deployment types for inference workloads, embedding ML models into web servers, and decoupling web and model serving. Understand the concept of Multi-Instance GPU (MIG) and its applications in model inferencing. Benefit from the speaker's expertise in DevOps, Cloud Computing, Kubernetes, and AI computing to overcome MLOps challenges and optimize your AI inference workflows.

Syllabus

Intro
Agenda
The Machine Learning Process
Deployment Types for Inference Workloads
Machine Learning is Different than Traditional Software Engineering
Low Latency
High Throughput
Maximize GPU Utilization
Embedding ML. Models into Web Servers
Decouple Web Serving and Model Serving
Model Serving System on Kubernetes
Multi-Instance GPU (MIG)
Run:Al's Dynamic MIG Allocations
Run 3 instances of type 2g.10gb
Valid Profiles & Configurations
Serving on Fractional GPUs
A Game Changer for Model Inferencing