Merlin - A Vision Language Foundation Model for 3D Computed Tomography

Learn about groundbreaking research in medical imaging through a Stanford University lecture where PhD candidate Ashwin Kumar presents Merlin, an innovative 3D Vision Language Model designed for computed tomography interpretation. Discover how this resource-efficient AI model processes over 6 million CT images from 15,331 scans, along with extensive electronic health records and radiology reports, to perform various diagnostic and prognostic tasks. Explore the model's capabilities across six task types and 752 individual tasks, including zero-shot findings classification, phenotype classification, disease prediction, and 3D semantic segmentation. Understand how Merlin addresses the growing need for automated medical image interpretation amid radiologist shortages, while achieving impressive results using minimal computational resources - requiring only a single GPU for training compared to conventional models needing hundreds.