Attribution Graphs - Edge Weights and Pruning

Explore advanced techniques for understanding large language model interpretability through this university lecture focusing on attribution graphs, edge weights, and automated circuit extraction methods. Learn how to analyze the internal mechanisms of LLMs by examining edge weights in neural network architectures and discover methods for automatically extracting extremely sparse feature circuits that reveal how these models process information. Delve into the mathematical foundations of attribution graphs and their role in making black-box language models more transparent and interpretable. Master techniques for pruning neural networks while maintaining performance, and understand how vignettes can be used to visualize and comprehend the sparse circuits that emerge from automated extraction processes.