Coursera Spring Sale
40% Off Coursera Plus Annual!
Grab it
Explore a conference talk that addresses model startup latency bottlenecks in modern inference workloads by presenting an innovative approach to accelerate model boot times through Triton kernel caching with OCI container images. Learn how to overcome the persistent challenge of Just In Time (JIT) compilation delays when using custom kernels written in Triton by implementing a novel solution that wraps Triton kernel caches in OCI container images. Discover through a working prototype demonstration how Triton-generated LLVM Kernels can be packaged into reusable, portable container layers that create "hot start" containers deployable directly to Kubernetes. Understand how this approach bypasses costly JIT compilation processes and significantly reduces model startup time, making it particularly valuable for ML infrastructure builders, OSS compiler developers, and professionals deploying models at scale. Gain practical techniques for optimizing cold starts in Models using Triton-lang, with insights applicable to modern containerized deployment environments and Kubernetes-based ML workflows.