NVFP4 with CUDA 13 Full Tutorial - 100%+ Speed Gain, Quality Comparison and New Cheap Cloud SimplePod
Software Engineering Courses - SE Courses via YouTube
Overview
Syllabus
New ComfyUI installer CUDA 13, Torch 2.9.1, Triton + attention libs
NVFP4 speedup claims vs real tests; why CUDA 13 enables new models
Prebuilt FlashAttention/SageAttention/xFormers for many GPUs Windows + Linux
Quality roadmap: FLUX2 Dev, Z Image Turbo, FLUX Dev BF16/FP8/GGUF/NVFP4
Downloader adds NVFP4: FLUX2 Dev, FLUX Dev Context/Dev, Z Image Turbo
SimplePod AI intro: RunPod-style pods, cheaper rates, permanent storage
Musubi Tuner FP8 Scaled: quality myths vs GGUF + why scaled matters
Quantization & precision FP32/BF16/FP8/GGUF + Qwen3 low-VRAM encoders
ComfyUI v73 zip: CUDA 13 included; update NVIDIA drivers only v72 deprecated
Update steps: overwrite zip, delete venv, run install/update .bat
Python: 3.10 recommended supports 3.10-3.13; fresh vs update
New installer flow: uv speed, standalone use, backend libs detected
Stability flags: --cache-none vs --disable-smart-memory OOM/stuck fixes
SwarmUI presets: 32 presets supported; drag/drop + auto model downloader
Update SwarmUI model-downloader zip extract + overwrite
Download bundles/models Z Image Turbo Core + NVFP4 options
Update/launch SwarmUI; point to updated ComfyUI backend + set args
Live gen test: Z Image Turbo BF16 @1536x1536
Switch to NVFP4: VRAM cache behavior; 1024x1024
FLUX2 Dev quality: FP8 Scaled vs NVFP4 side-by-side comparisons
Speed chart: FLUX2 NVFP4 about 193% faster than FP8 Scaled
Z Image Turbo quality: BF16 vs NVFP4 vs FP8 Scaled quant method
FLUX Dev: FP8 Scaled approx GGUF Q8; NVFP4 currently shows degradation
What precision means + model size examples FP32/BF16/FP8 Scaled/NVFP4
Practical recommendations: BF16 best; avoid FP16; raw FP8 vs FP8 Scaled
GGUF explained: block quant, slower runtime; use only when RAM is too low
Precision hierarchy recap + when to pick FP8 mixed/scaled over GGUF
SimplePod setup: register, add credits, open template link
Template config + RunPod price comparison disk, ports, GPU selection
Persistent volume: create + mount to /workspace
Launch RTX Pro 6000 pod; SimplePod vs RunPod pricing differences
Temp vs persistent disk: deleting instance wipes temp data - backup!
JupyterLab: upload zips, apt install zip, unzip ComfyUI in workspace
Run install script; unzip SwarmUI; start the model downloader
Downloader path for ComfyUI + folder structure; download Z Image Turbo bundle
Start ComfyUI; confirm CUDA 13 + Torch 2.9.1; connect via port 3000 Direct
Preset demo: Z Image Turbo Quality 1; fix VAE path; monitor VRAM
File Browser Direct: download outputs/models fast; upload files back
Restart server; install/start SwarmUI; open Cloudflared URL
SwarmUI backend: /workspace/ComfyUI/main.py + args; import presets
Download FLUX2 Core + NVFP4; share model paths between SwarmUI & ComfyUI
FLUX2 NVFP4 generation @2048x2048; VRAM usage + step speed
Cloud GPU pitfall: diagnosing a power-capped GPU
Resume: re-run template w/ volume; reconnect fast
Wrap-up: SimplePod pros direct/secure, cheaper storage
Taught by
Software Engineering Courses - SE Courses