Probably Approximately Correct Labels

Explore a mathematical framework for cost-effective dataset labeling that combines expert annotations with AI predictions in this conference talk from Harvard's Center of Mathematical Sciences and Applications. Learn how to construct high-quality labeled datasets by supplementing expensive human annotations or experimental data with predictions from pre-trained AI models, while maintaining rigorous statistical guarantees. Discover the theoretical foundations behind "probably approximately correct labels" - a method that ensures with high probability that overall labeling error remains small. Examine practical applications across three domains: text annotation using large language models, image classification with pre-trained vision models, and protein structure analysis with AlphaFold. Understand how this approach enables efficient dataset curation while preserving the reliability needed for machine learning applications, presented as part of the Workshop on Mathematical Foundations of AI by Stanford researcher Tijana Zrnic in collaboration with Emmanuel Candes and Andrew Ilyas.