Optimistic Verifiable Training by Controlling Hardware Nondeterminism

Learn about a novel approach to verifiable machine learning training that addresses hardware nondeterminism challenges in this Google TechTalk. Explore how increasing compute demands have led to commercial training services where clients outsource model training, creating new security concerns around training correctness and potential attacks like data poisoning and backdoors. Discover the limitations of existing verifiable training methods, including cryptographic proof-based systems that struggle with scalability and optimistic methods that fail due to GPU nondeterminism preventing exact training replication. Understand the proposed solution that combines higher precision training, strategic rounding after intermediate computations, and adaptive thresholding procedures to control nondeterminism across different hardware configurations. Examine experimental results demonstrating exact training replication at FP32 precision across three NVIDIA GPU types (A40, Titan XP, RTX 2080 Ti) for both full-training and fine-tuning of ResNet-50 and GPT-2 models. Analyze the significant performance improvements this method achieves, including up to 140x reduction in storage costs and up to 220x reduction in time costs compared to proof-based systems, making verifiable training more practical for real-world applications.