Overview
Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to build a production-ready LLM deployment stack that goes beyond basic Python scripts wrapped in Docker containers. Explore the challenges of deploying Large Language Models to production environments, including issues with high latency, security vulnerabilities, and lack of monitoring visibility. Discover how to construct a comprehensive inference stack using consumer GPUs with vLLM for efficient model serving, nginx for load balancing and reverse proxy functionality, and Grafana with Prometheus for comprehensive monitoring and observability. Master the configuration of Docker Compose for orchestrating multiple services, implement proper nginx configuration for production traffic handling, and set up robust monitoring systems to track performance metrics and system health. Follow along with a practical virtual instance setup and witness live load testing using LangChain client to validate the deployment's performance under realistic conditions. Gain insights into why simple containerized Python scripts fail in production scenarios and understand the architectural decisions needed for scalable, secure, and observable LLM deployments.
Syllabus
00:00 - Why Python script fail in production
01:47 - The stack architecture vLLM, nginx, Grafana
04:42 - Docker compose definition
08:35 - Nginx config
09:08 - Monitoring with Prometheus and Grafana config
10:13 - Virtual instance setup
13:54 - Live load test with LangChain client
Taught by
Venelin Valkov