Serve a Text to Speech Model with vLLM

Learn to deploy and serve the Orpheus Text-to-Speech model using vLLM with continuous batching capabilities in this technical tutorial. Set up a demonstration environment using a one-click template from Runpod, then explore running inference on both fine-tuned and default Orpheus models. Discover the technical implementation details of how vLLM integrates with Orpheus, including the process of decoding audio tokens from text input. Compare inference results between different model configurations, including considerations for fp8 precision and fine-tuning quality. Access the accompanying one-click-llms repository to follow along with the practical implementation steps for serving text-to-speech models efficiently.

Syllabus

0:00 Serving Orpheus Text-to-Speech model with continuous batching
0:44 Setup Demo with a one-click template from Runpod
4:12 Running inference on a fine-tuned model poor quality, maybe don’t use fp8, and tune more
5:25 Inference on the default orpheus model, “tara”
7:37 How vLLM works with Orpheus and how to decode audio tokens
12:38 Conclusion and Resources