GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

In this 31-minute talk from the Massachusetts Institute of Technology, explore GLOV, a framework that reduces manual effort in creating effective prompts for vision-language models (VLMs). Learn how large language models (LLMs) can function as implicit optimizers that iteratively refine VLM prompts based on task performance, eliminating the need for human intervention. Discover how embedding space steering vectors guide LLM generation during the optimization process, creating a bias toward more effective prompts. See evaluations across multiple downstream tasks and VLM architectures that demonstrate GLOV's strong generalization capabilities. Presented by Jehanzeb Mirza, a postdoc in MIT CSAIL's Spoken Language Systems group whose research focuses on multi-modal learning and fine-grained understanding.