Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

Coursera

Deploying Deep Learning: Quantization, Serving, and Edge AI

Board Infinity via Coursera

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
"Production Deep Learning: Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp. Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff. Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices. Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will: - Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer: This is an independent educational resource created by Board Infinity for informational and educational purposes only. This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program. All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.

Syllabus

  • Model Compression, Quantization & Latency Optimization
    • Learn model compression fundamentals, memory profiling, and modern INT8/INT4 quantization techniques including AWQ and GPTQ to optimize models for production inference.
  • High-Throughput Serving - vLLM, PagedAttention & Triton
    • Master production-grade serving engines including vLLM with PagedAttention and NVIDIA Triton for scaling inference across GPUs and nodes.
  • ONNX, Llama.cpp & Edge / CPU Deployment
    • Export models to ONNX for interoperability, deploy LLMs on CPU and edge devices with Llama.cpp and GGUF, and build multimodal pipelines with CLIP and LLaVA.
  • Final Project - The Edge-Ready API (Quantize to Serve to Benchmark)
    • Apply all course concepts in a final project to quantize a fine-tuned model, serve it via vLLM, benchmark it, and package it for cloud and edge deployment.

Taught by

Board Infinity

Reviews

Start your review of Deploying Deep Learning: Quantization, Serving, and Edge AI

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.