Quantization at the Edge - Making a 4GB Model Run on 1GB RAM

Learn practical techniques for deploying large language models on memory-constrained edge devices through this conference talk from DevConf.IN 2026. Discover how to overcome the fundamental challenge of running generative AI on affordable ARM boards that typically have less than 2GB of RAM, where traditional cloud inference introduces latency, privacy concerns, and connectivity issues. Explore aggressive quantization methods that go beyond standard 8-bit or 4-bit approaches, including operator fusion, KV-cache trimming, and runtime memory pooling techniques specifically designed for sub-2GB RAM environments. Master the use of open-weight models, offline quantization processes, and lightweight inference runtimes optimized for ARM CPUs to achieve dramatic memory reduction while maintaining usable model accuracy. Watch a live demonstration showing how to successfully load and run a quantized 4GB model on a basic 1GB device, proving the viability of privacy-friendly, low-cost AI deployments at the edge. Gain insights valuable for embedded engineers, makers, AI practitioners, and cloud-edge architects looking to implement practical solutions for memory-constrained AI applications without relying on server-class hardware or cloud dependencies.