Automated Evaluation of LLM Apps with Azure AI-Generative SDK

Explore automated evaluation techniques for Large Language Model (LLM) applications using the azure-ai-generative SDK in this 27-minute conference talk from Python Data Science Day. Learn about different types of LLM apps, including prompt-only and Retrieval Augmented Generation (RAG) models. Discover how to assess answer quality, implement LLM Ops, and experiment with quality factors. Dive into the AI RAG Chat Evaluator tool, understand the importance of ground truth data, and explore evaluation approaches. Gain insights on improving data sets and future steps in LLM app development. Access valuable resources, including slides, demos, and repositories to enhance your understanding of automated LLM app evaluation.

Syllabus

Automated evaluation of LLM apps with the azure-ai-generative SDK
Types of LLM apps
Prompt-only LLM app
Retrieval Augmented Generation RAG LLM app
RAG flow
Are the answers high quality?
LLM Ops for LLM Apps
Experimenting with quality factors
AI RAG Chat Evaluator: https://aka.ms/rag/eval
Ground truth data
Evaluation
Evaluation approach
Improving ground truth data sets
Next steps