Multi-Domain Large Language Model Adaptation Using Synthetic Data Generation

Learn how Shell researchers tackle the challenge of preserving institutional knowledge by adapting large language models for domain-specific applications in this 18-minute conference talk from Fully Connected London. Discover Shell's approach to fine-tuning off-the-shelf LLMs that lack understanding of domain-specific language, as NLP Researcher Injy Sarhan and Senior Researcher Avanindra Singh detail their development of a research assistant designed to make research more efficient. Explore their comprehensive domain ingestion pipeline built with NVIDIA Nemo Curator and W&B Weave, covering essential processes including data preprocessing, domain adaptation, instruction tuning, and evaluation methodologies. Understand how domain-adapted LLMs achieve superior domain-specific reasoning capabilities and improved factual accuracy, while examining how W&B Weave's LLM-as-judge functionality and feedback loops successfully aligned manual and auto-generated benchmarks to ensure model performance meets organizational standards.