Do Large Language Models Need Statistical Foundations?

Mon Apr 21, 2025 4:00 p.m.—5:00 p.m.
Weije

This event has passed.

Weije

Speaker

Weijie Su, Wharton University of Pennsylvania

In this talk, we advocate for the development of rigorous statistical foundations for large language models(LLMs ). We begin by elaborating two key features that motivate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the complexity and black box nature of Transformer architectures. To illustrate how statistical insights can directly benefit LLM development and applications, we present two concrete examples. First, we demonstrate statistical inconsistencies and biases arising from the current approach to aligning LLMs with human preference. We propose a regularization term for aligning LLMs that is both necessary and sufficient to ensure consistent alignment. Second, we introduce a novel statistical framework to analyze the efficiency of watermarking schemes, with a focus on a watermarking scheme developed by OpenAI for which we derive optimal detection rules that outperform existing ones. Collectively, these findings showcase how statistical insights can address pressing challenges in LLMs while simultaneously illuminating new research avenues for the broader statistical community to advance responsible generative AI research. This talk is based on arXiv:2405.16455, 2404.01245, and 2503.10990.

3:30pm  - Pre-talk meet and greet teatime - 219 Prospect Street, 13 floor, there will be light snacks and beverages in the kitchen area.

This event was also available on Webcast