Overview
Syllabus
- Introduction to serving multiple models on GPU
- Overview of using LoRA adapters as clip-ons
- Video structure overview
- Theory of LoRA for inference
- Explanation of LoRA Low Rank Adapters
- Benefits of using LoRA for training
- Practical implementation of LoRA loading
- GPU VRAM and model loading explanation
- Managing adapter downloads and storage
- Basic LoRaX Implementation
- Setting up the environment
- Running inference with LoRaX
- Setting up SSH connection for Runpod
- Advanced vLLM Implementation
- Building the proxy server
- Redis implementation for adapter management
- Starting the server
- Testing the service
- FineTuneHost.com service demonstration
- Conclusion and resource overview
Taught by
Trelis Research