AI Orchestration: From local models to cloud

Overview

Google, IBM & Meta Certificates — All 10,000+ Courses at 40% Off

One annual plan covers every course and certificate on Coursera. 40% off for a limited time.

Learn to orchestrate AI systems across local and cloud environments through hands-on infrastructure setup, model deployment, and workflow integration. You will build a prompt engineering pyramid from basic prompts to chain-of-thought reasoning implemented in Rust, then evaluate six decision factors for choosing between local and cloud models including latency, throughput, cost, and privacy. The course covers local AI infrastructure in depth: running Ollama with custom Modelfiles for task-specific assistants, deploying llamafile for zero-dependency portable inference, compiling Rust Candle with CUDA for GPU-accelerated local inference, and optimizing local RAG with caching strategies. You will configure a complete AI workstation with tmux for session management, nvidia-smi and Zenith for GPU monitoring, and NVIDIA GPU optimization. The final module covers cloud workflows including AWS Spot instances for cost-effective GPU compute, Hugging Face model discovery and download, and GitHub AI models integration. By completing this course, you will be able to set up local AI infrastructure, deploy models across local and cloud environments, and design orchestration workflows that balance cost, privacy, and performance.

Syllabus

Orchestration Fundamentals

A comprehensive course covering prompt engineering with chain-of-thought reasoning, local inference runtimes (Ollama, llamafile, Candle), GPU workstation configuration, and cost-optimized cloud deployment with AWS Spot instances.

Local AI Infrastructure

Covers local vs cloud model tradeoffs, caching strategies, local RAG optimization, Ollama with custom Modelfiles, llamafile portable deployment, and Candle GPU-accelerated Rust inference.

Workstation and Cloud Workflows

Covers tmux session management, nvidia-smi and Zenith GPU monitoring, local workstation orchestration, AWS Spot instance deployment, Hugging Face and GitHub AI model workflows, and Rust project structure.

Capstone

Head-to-head comparison of Ollama vs `apr` ([paiml/aprender](https://github.com/paiml/aprender)) running Qwen2.5-Coder-1.5B on the same prompt suite, same hardware. Build a chain-of-thought routing engine that selects runtimes based on task complexity and validation requirements, with cost analysis spanning local workstations, Spot instances, and Bedrock.