PowerBI Data Analyst - Create visualizations and dashboards from scratch
Learn Python with Generative AI - Self Paced Online
Overview
Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off
One annual plan covers every course and certificate on Coursera. 40% off for a limited time.
Get Full Access
Learn practical techniques for deploying large language models on memory-constrained edge devices through this conference talk from DevConf.IN 2026. Discover how to overcome the fundamental challenge of running generative AI on affordable ARM boards that typically have less than 2GB of RAM, where traditional cloud inference introduces latency, privacy concerns, and connectivity issues. Explore aggressive quantization methods that go beyond standard 8-bit or 4-bit approaches, including operator fusion, KV-cache trimming, and runtime memory pooling techniques specifically designed for sub-2GB RAM environments. Master the use of open-weight models, offline quantization processes, and lightweight inference runtimes optimized for ARM CPUs to achieve dramatic memory reduction while maintaining usable model accuracy. Watch a live demonstration showing how to successfully load and run a quantized 4GB model on a basic 1GB device, proving the viability of privacy-friendly, low-cost AI deployments at the edge. Gain insights valuable for embedded engineers, makers, AI practitioners, and cloud-edge architects looking to implement practical solutions for memory-constrained AI applications without relying on server-class hardware or cloud dependencies.
Syllabus
Quantization at the Edge: Making a 4GB Model Run on 1GB RAM - DevConf.IN 2026
Taught by
DevConf