Tuesday, October 24, 2017

Thesis: Sketching for Large-Scale Learning of Mixture Models by Nicolas Keriven

Congratulations Dr. Keriven ! We've featured some of his work before... 
but now we have the whole thesis and Nicolas tells me there is more good stuff in it. Woohoo !. 



Automatic learning processes are becoming ubiquitous in many domains of science. However, nowadays databases commonly comprise millions or billions of elements, which challenge traditional learning methods. Furthermore, modern database architectures involve new difficulties: data may be seen once then discarded (a situation usually referred to as data stream), often databases are not stored in one single location but distributed across several storage places and it is undesirable to gather the whole database in one place for the sake of privacy and robustness to malicious attacks. It has thus become necessary to derive learning procedures that are amenable to very large databases, and to distributed and streaming computing. A popular idea is to define an intermediary compressed representation of a database, which is fast to compute, adapted to streaming and distributed computing through update and merge mechanisms, preserve data privacy, and such that the desired learning task can be performed using only this compressed representation, with a computational complexity that is greatly reduced compared to using the full database. A popular class of such representations is called linear sketches: the whole database is compressed into a single fixed-size vector called sketch, such that the sketch of the union of two databases is the sum of their sketches. Because of this property it is obvious that linear sketches are particularly convenient for streaming, distributed and parallel computing. In [BGP13; BGP15], Bourrier et al. introduced a learning method based on a linear sketch formed by a random sampling of the empirical characteristic function of a collection of multidimensional vectors. They showed empirically that it was possible to fit a Gaussian Mixture Model (GMM) with fixed identity covariance on the original data, using only its sketch. However, the method was restricted to GMMs with identity covariance, and theoretical justi- fications were still an open question. Extending this method to other models and providing a theoretical analysis of the approach is the main purpose of this thesis work. To do so, we develop an original framework based on several different sets of mathematical tools. The expression of the sketching operator is formalized by combining kernel mean embedding, which allows to define tunable Hilbertian metrics on the set of probability distributions, with Random Feature expansions, that approximate the infinite-dimensional mapping associated with a kernel function by a finite-dimensional mapping designed randomly. Using this mathematical framework, we analyze the sketching method under the lens of Compressive Sensing, which states that any signal that is in some sense less complex than the ambient dimension can be successfully compressed and estimated. We adapt classic proofs for finite-dimensional settings to our generalized infinite-dimensional framework. We provide guarantees for many problems, including for that of estimating mixtures of multivariate elliptic α-stable distributions from a sketch, for which no estimator was known. We particularly extend the framework and relate it to more traditional learning in two cases: first when recovering centroids from a sketch for the k-means or k-medians problem, and for GMM estimation with known covariance. We introduce a flexible heuristic greedy algorithm coined Compressive Learning - Orthogonal Matching Pursuit with Replacement (CL-OMPR) that can estimate any parametric mixture model from any sketch in a very wide variety of situations. Experiments are performed on real and synthetic data for three models. First, mixtures of Diracs, for which our approach is shown to be more efficient and more stable than k-means on large databases; second, GMMs with unknown diagonal covariances, where the proposed approach is seen to be faster and lighter that classic Expectation Maximization (EM). And, finally, mixtures of multivariate elliptic α-stable distributions, where our approach is the first viable algorithm of which we are aware that can perform this task.


Résumé : Les bases de données modernes sont de très grande taille, parfois divisées et distribuées sur plusieurs lieux de stockage, ou encore sous forme de flux de données : ceci soulève de nouveaux défis majeurs pour les méthodes d’apprentissage statistique. Une des méthodes récentes capable de s’adapter à ces situations consiste à d’abord compresser les données en une structure appelée sketch linéaire, puis ensuite de réaliser la tâche d’apprentissage en utilisant uniquement ce sketch, ce qui est extrêmement rapide si celui-ci est de petite taille. Dans cette thèse, nous définissons une telle méthode pour estimer un modèle de mélange de distributions de probabilités à partir des données, en utilisant uniquement un sketch de celles-ci. Ce sketch est défini en s’inspirant de plusieurs notions venant du domaine des méthodes à noyaux : le plongement par noyau moyen et les approximations aléatoires de noyaux. Défini comme tel, le sketch correspond à des mesures linéaires de la distribution de probabilité sous-jacente aux données. Ainsi nous analysons le problème en utilisant des outils venant du domaine de l’acquisition comprimée, dans lequel un signal est mesuré aléatoirement sans perte d’information, sous certaines conditions. Nous étendons certains résultats de l’acquisition comprimée à la dimension infinie, donnons des conditions génériques garantissant le succès de notre méthode d’estimation de modèles de mélanges, et les appliquons à plusieurs problèmes, dont notamment celui d’estimer des mélanges de distributions stables multivariées, pour lequel il n’existait à ce jour aucun estimateur. Notre analyse est basée sur la construction d’opérateurs de sketch construits aléatoirement, qui satisfont une Propriété d’Isométrie Restreinte dans l’espace de Banach des mesures finies signées avec forte probabilité. Dans une second partie, nous introduisons un algorithme glouton capable heuristiquement d’estimer un modèle de mélange depuis un sketch linéaire. Cet algorithme est appliqué sur données simulées et réelles à trois problèmes : l’estimation de centres significatifs dans les données, pour lequel on constate que la méthode de sketch est significativement plus rapide qu’un algorithme de k-moyennes classique, l’estimation de mélanges de Gaussiennes, pour lequel elle est plus rapide qu’un algorithme d’Espérance-Maximisation, et enfin l’estimation de mélange de distributions stables multivariées, pour lequel il n’existait à ce jour, à notre connaissance, aucun algorithme capable de réaliser une telle tâche.

No comments: