Interrogating the Quality of a Language Model's Clinical Knowledge

Explore the critical challenges of evaluating large language models for healthcare applications in this 59-minute conference presentation by Dr. Danielle Bitterman from Brigham and Women's Hospital. Examine the gap between performance on standard knowledge benchmarks and real-world clinical safety and utility, focusing on methods to identify where language models may exhibit unfavorable or unsafe behavior in medical contexts. Learn about research approaches for understanding and predicting model risks, including potential biases that could impact patient care. Discover how analysis of pretraining data can help anticipate problematic model behaviors and gain insights into standardized methodologies for robust evaluation of AI systems in healthcare settings. Understand the intersection of natural language processing and clinical medicine through the lens of a physician-scientist who specializes in AI for cancer care, with particular emphasis on transforming medical records into data-driven care systems and ensuring equitable healthcare AI implementation.