Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn how to dramatically reduce model startup latency by implementing Triton kernel caching with OCI container images in this conference talk from DevConf.US 2025. Discover a novel approach to solving the persistent bottleneck of Just In Time (JIT) compilation for custom Triton kernels in modern inference workloads. Explore the technical implementation of wrapping Triton kernel caches in OCI container images, creating reusable and portable container layers that package Triton-generated LLVM kernels. Watch a live demonstration of a working prototype that enables "hot start" containers deployable directly to Kubernetes, effectively bypassing costly JIT compilation processes. Gain practical techniques for optimizing cold starts in ML infrastructure, understand the benefits for OSS compiler workflows, and learn strategies for deploying models at scale using Triton-lang optimizations.