Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Learning High-dimensional Gaussians from Censored Data

Debmalya Mandal · Stefanie Jegelka · Themis Gouleakis · Yuhao Wang


Abstract: We provide efficient algorithms for the problem of distribution learning from high-dimensional Gaussian data where in each sample, some of the variable values are missing. We suppose that the variables are {\em missing not at random (MNAR)}.The missingness model, denoted by S(y), is the function that maps any point yRd to the subsets of its coordinates that are seen. In this work, we assume that it is known. We study the following two settings:- [**Self-censoring**] An observation x is generated by first sampling the true value y from a d-dimensional Gaussian N(μ,Σ) with unknown μ and Σ. For each coordinate i, there exists a set SiRd such that xi=yi if and only if yiSi. Otherwise, xi is missing and takes a generic value (e.g ?"). We design an algorithm that learns N(μ,Σ) up to TV distance ε, using \textuppoly(d,1/ε) samples, assuming only that each pair of coordinates is observed with sufficiently high probability. - [**Linear thresholding**] An observation x is generated by first sampling y from a d-dimensional Gaussian N(μ,Σ) with unknown μ and known Σ, and then applying the missingness model S where S(y)={i[d]:vTiybi} for some v1,,vdRd and b1,,bdR. We design an efficient mean estimation algorithm, assuming that none of the possible missingness patterns is very rare conditioned on the values of the observed coordinates and that any small subset of coordinates is observed with sufficiently high probability.

Live content is unavailable. Log in and register to view live content