Consider a population consisting of n individuals, each of whom has one of d types (e.g. their blood type, in which case d=4).
We are allowed to query this database by specifying a subset of the population, and in response we observe a noiseless histogram
(a d-dimensional vector of counts) of types of the pooled individuals.
This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations.
We study this model in the “random dense” setting where each query involves a random subset of individuals of size proportional to n.
We are interested in the number of queries one needs to unambiguously determine the type of each individual,
both information-theoretically and in an algorithmically efficient way.
We discuss the information-theoretic question and present a message-passing algorithm for recovering the signal from a minimal
number of measurements and characterize its exact asymptotic behavior.
The analysis reveals a gap between what is statistically possible and what is achieved by our algorithm.
This is joint work with Aaditya Ramdas, Florent Krzakala, Lenka Zdeborova, and Michael Jordan.
Preprints: arxiv:1611.09981, arxiv:1702.02279
- Speaker: Ahmed El Alaoui (UC Berkeley)
- Friday 16 June 2017, 16:00–17:00
- Venue: MR12, Centre for Mathematical Sciences, Wilberforce Road, Cambridge..
- Series: Statistics; organiser: Quentin Berthet.