Towards Pluralistic Alignment - From Axiomatic Approach to Pairwise Calibration

Explore the intersection of AI alignment and social choice theory in this seminar that addresses the critical challenge of aligning AI systems with diverse human values. Learn how traditional reinforcement learning from human feedback (RLHF) methods often assume a single ground truth while overlooking the heterogeneous nature of human preferences, potentially leading to reward functions that fail to capture the full spectrum of human values. Discover an innovative axiomatic approach to reward design that reveals how widely used methods like the Bradley-Terry-Luce (BTL) model fail to meet basic axiomatic guarantees. Examine the novel linear social choice framework that leverages the linear structure inherent in RLHF to derive aggregation rules with strong theoretical guarantees. Understand the development of ensemble-based reward functions designed to preserve the diversity of human preferences without collapsing them into a monolithic reward function. Investigate the concept of pairwise calibrated rewards, where distributions over multiple reward functions are directly calibrated to observed pairwise preferences, with theoretical proof that even small outlier-free ensembles can accurately represent diverse preference distributions. Review empirical validation of practical training heuristics for learning such ensembles and their effectiveness in achieving improved calibration for more faithful representation of pluralistic values.