Learning Large Softmax Mixtures with Warm Start EM

Mon Oct 28, 2024 4:00 p.m.—5:00 p.m.

This event has passed.

Speaker

Florentina Bunea, Cornell University

Mixed multinomial logits are discrete mixtures introduced several decades ago to model the probability of choosing an attribute xj 2 RL from p possible candidates, in heteroge-neous populations. The model has recently attracted attention in the AI literature, under the name softmax mixtures, where it is routinely used in the nal layer of a neural network to map a large number p of vectors in RL to a probability vector. Despite its wide applicability and empirical success, statistically optimal estimators of the mixture parameters, obtained via algorithms whose running time scales polynomially in L, are not known. This paper provides a solution to this problem for contemporary applications, such as LLMs (Large Language Models), in which the mixture has a large number p of support points, and the size N of the sample observed from the mixture is also large. Our proposed estimator combines two classical estimators, obtained respectively via a method of moments

(MoM) and the expectation-minimization (EM) algorithm. Although both estimator types have been studied, from a theoretical perspective, for Gaussian mixtures, no similar results exist for softmax mixtures for either procedure. We develop a new MoM parameter estimator based on latent moment estimation that is tailored to our model, and provide the rst theoretical analysis for a MoM-based procedure in softmax mixtures. Although consistent, as N; p ! 1, MoM for softmax mixtures can exhibit poor numerical performance, an empirical observation that is in line with those made for other mixture models. Nevertheless, as MoM is provably in a neighborhood of the target, it can be used as warm start for any iterative algorithm. We study in detail the EM algorithm, and provide its rst theoretical

analysis for softmax mixtures, extending the only other class of similar results, valid for Gaussian mixtures. Our nal proposal for parameter estimation is the EM algorithm with a MoM warm start. In addition to leading to the desired parametric estimation rates, this combined procedure provides computational savings relative to the standard practice of selecting one of the outputs of multiple EM runs, each initialized at random. These facts are supported by our simulation studies. Concrete examples that substantiate the large applicability of the model will be given throughout the talk.

3:30pm - Pre-talk meet and greet teatime - 219 Prospect Street, 13 floor, there will be light snacks and beverages in the kitchen area.
Florentina Bunea’s website
The Zoom link