Monday, December 7, 2015 - 4:15pm to 5:30pm

University of Illinois at Urbana-Champaign

Polynomial approximation, moment matching and optimal estimation of the unseen

Estimating the support size of a distribution is a classical problem in statistics, dating back to the early work of Fisher, Good-Turing, and the influential work by Efron-Thisted on "how many words did Shakespeare know." In the sublinear regime where the alphabet cardinality far exceeds the sample size, the challenge lies in how to extrapolate the number of unseen symbols based on what have been observed so far.
We first consider the estimation of the support size of an arbitrary discrete distribution whose minimum non-zero mass is at least $\frac{1}{k}$. We show that, to achieve an additive error of $\epsilon k$ with probability at least, say, 0.9, the minimal number of independent samples is within universal constant factors of $\frac{k}{\log k} \log^2\frac{1}{\epsilon}$. The apparatus of best polynomial approximation, as well as its duality to the moment matching problem, play a key role in both the minimax lower bound and the construction of optimal linear-time estimators based on Chebyshev polynomials. We also discuss the closely-related species problem where the goal is to estimate the number of distinct colors in an ball-urn model from repeated draws. We characterize the optimal sample complexity by means of a linear program of logarithmically many variables. Extensions to other distributional properties such as Shannon entropy and experimental results will be discussed.
The unifying theme is a duality view: For various estimation problems on large alphabets, we obtain an effective minimax theorem in terms of a convex or linear program, where the primal solution corresponds to a rate-minimax estimator, the dual solution gives rise to a least favorable prior, and the value of the program yields a constant-factor approximation of the minimax risk.
(Joint work with Pengkun Yang)