Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

A color photo of a man smiling for a photo outdoors.

Date: February 13, 2026

Speaker: Shiqi Zhang

Title: Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Abstract: After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average — a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.

The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax

optimal distortion of (1/2+o⁢(1))⋅β (for the BT temperature β), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer ≥(1−o⁢(1))⋅β distortion already without a KL constraint, and e^Ω⁢(β) or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

Bio: Paul Gölz is an Assistant Professor in the School of Operations Research and Information Engineering at Cornell University. His research explores new approaches to democracy, equitable resource allocation, and developing AI systems for heterogeneous users. Algorithms developed in his work are now deployed to select citizens’ assemblies around the world and to allocate refugees for a major US resettlement agency.