## Friday, July 29, 2016

### Stochastic Frank-Wolfe Methods for Nonconvex Optimization

All the previous blog entries around the Frank-Wolfe can be found under this Frank-Wolfe tag.

We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For objective functions that decompose into a finite-sum, we leverage ideas from variance reduction techniques for convex optimization to obtain new variance reduced nonconvex Frank-Wolfe methods that have provably faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting.
h/t Atlas Wang

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Thursday, July 28, 2016

### The non-convex Burer-Monteiro approach works on smooth semidefinite programs / On the low-rank approach for semidefinite programs arising in synchronization and community detection

Vlad just sent me the following:

Hi Igor,

You may find of interest my new paper (joint work with Nicolas Boumal and Afonso Bandiera) on theoretical guarantees for the Burer Monteiro approach for smooth SDPs. We are able to show the following: consider a fixed set of m linear constraints for an SDP with compact domain, and the corresponding factorized BM formulation with rank constrained to be at most about sqrt(m). If the rank-constrained problem has a domain which is a smooth manifold, then generically in the cost matrix C of the SDP, this BM rank-constrained problem is non-convex in only a benign way, in that 2nd order critical points are global minimizers. We also establish quantitative bounds on the quality of candidate solution as a function of approximate 2nd order optimality and provide examples of applications and a numerical study.

http://arxiv.org/pdf/1606.04970.pdf

There is also a related paper with the same authors in which we analyze regimes of planted MaxCut and community detection. In these specific settings we provide much stronger guarantees, in that the BM factorization has no spurious local optima even at rank 2.

http://arxiv.org/pdf/1602.04426v2.pdf

Best,

Semidefinite programs (SDP's) can be solved in polynomial time by interior point methods, but scalability can be an issue. To address this shortcoming, over a decade ago, Burer and Monteiro proposed to solve SDP's with few equality constraints via rank-restricted, non-convex surrogates. Remarkably, for some applications, local optimization methods seem to converge to global optima of these non-convex surrogates reliably. Although some theory supports this empirical success, a complete explanation of it remains an open question. In this paper, we consider a class of SDP's which includes applications such as max-cut, community detection in the stochastic block model, robust PCA, phase retrieval and synchronization of rotations. We show that the low-rank Burer-Monteiro formulation of SDP's in that class almost never has any spurious local optima.

To address difficult optimization problems, convex relaxations based on semidefinite programming are now common place in many fields. Although solvable in polynomial time, large semidefinite programs tend to be computationally challenging. Over a decade ago, exploiting the fact that in many applications of interest the desired solutions are low rank, Burer and Monteiro proposed a heuristic to solve such semidefinite programs by restricting the search space to low-rank matrices. The accompanying theory does not explain the extent of the empirical success. We focus on Synchronization and Community Detection problems and provide theoretical guarantees shedding light on the remarkable efficiency of this heuristic.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Wednesday, July 27, 2016

### Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION (TM) sequencing

About four years ago, I tried to predict the future for August 25, 2030. In order to do this, I first mentioned The Steamrollers i.e. technologies that were exponential in nature. Nanopore sequencing was one of them. In the second installment, I mentioned different algorithms that could help in making sense of the data generated by these steamrollers (Predicting the Future: Randomness and Parsimony). Streaming was one of them. It is really no surprise, if like in hyperspectral imaging or nanopore sequencing you are producing a lot of data, your interest switch from the modeling aspect of things to how can it be helpful now and how fast.  How does it change science ? well you just need to read the following article:

The main contribution of this article is to demonstrate that despite the higher error rate, it is possible to return clinical actionable information, including species and strain identification from as few as 500 reads. We achieved this by developing novel approaches that are less sensitive to base-calling errors and which use whatever subset of genome-wide information is observed up to a point in time, rather than a panel of pre-defined markers or genes. For example, the strain typing presence/absence approach relies only on being able to identify homology to genes and also allows for a level of incorrect gene annotation.

Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTMsequencing by Minh Duc Cao, Devika Ganesamoorthy, Alysha G. Elliott, Huihui Zhang, Matthew A. Cooper and Lachlan J.M. Coin
The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This has great potential to shorten the sample-to-results time and is likely to have benefits such as rapid diagnosis of bacterial infection and identification of drug resistance. However, there are few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, strain typing and antibiotic resistance profile identification. Using four culture isolate samples, as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 min of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 h. While strain identification with multi-locus sequence typing required more than 15x coverage to generate confident assignments, our novel gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### gvnn: Neural Network Library for Geometric Computer Vision

I learned about Spatial transformers at a Deep Learning meetup in Paris from a presentation by Alban Desmaison (yes I sometimes do attend other awesome meetups). Spatial transformers are "aiming at boosting the geometric invariance of CNNs" Transformers are using transformations from computer vision and camera models to help deep neural networks learn better or faster. They do so by providing models behind generic invariances like rotation, translation etc.... It is a little bit unsettling since most defenders of deep neural architectures would rather add data than get help from vision  models :-) Anyway, let's see this as another instance of the Great Convergence where Machine Learning and Computer graphics, the old computer vision paradigm and signal processing meet. Today, we have a library written in Torch that describes several transformation for inclusion in neural networks.

Let me just add that while these transformers do help neural networks in reducing the number of network coefficients, the remaining coefficients are probably summarizing many of the geometric invariances that we collectively have not yet discovered -this is a sense I got from a presentation by Stephane Mallat a while back when he mentioned the connection between deep neural networks and his scattering networks-.

Unrelated but related to the transformers endeavor:

gvnn: Neural Network Library for Geometric Computer Vision by Ankur Handa, Michael Bloesch, Viorica Patraucean, Simon Stent, John McCormac, Andrew Davison

We introduce gvnn, a neural network library in Torch aimed towards bridging the gap between classic geometric computer vision and deep learning. Inspired by the recent success of Spatial Transformer Networks, we propose several new layers which are often used as parametric transformations on the data in geometric computer vision. These layers can be inserted within a neural network much in the spirit of the original spatial transformers and allow backpropagation to enable end-to-end learning of a network involving any domain knowledge in geometric computer vision. This opens up applications in learning invariance to 3D geometric transformation for place recognition, end-to-end visual odometry, depth estimation and unsupervised learning through warping with a parametric transformation for image reconstruction error.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Tuesday, July 26, 2016

### Dual Purpose Hashing

Recent years have seen more and more demand for a unified framework to address multiple realistic image retrieval tasks concerning both category and attributes. Considering the scale of modern datasets, hashing is favorable for its low complexity. However, most existing hashing methods are designed to preserve one single kind of similarity, thus improper for dealing with the different tasks simultaneously. To overcome this limitation, we propose a new hashing method, named Dual Purpose Hashing (DPH), which jointly preserves the category and attribute similarities by exploiting the Convolutional Neural Network (CNN) models to hierarchically capture the correlations between category and attributes. Since images with both category and attribute labels are scarce, our method is designed to take the abundant partially labelled images on the Internet as training inputs. With such a framework, the binary codes of new-coming images can be readily obtained by quantizing the network outputs of a binary-like layer, and the attributes can be recovered from the codes easily. Experiments on two large-scale datasets show that our dual purpose hash codes can achieve comparable or even better performance than those state-of-the-art methods specifically designed for each individual retrieval task, while being more compact than the compared methods.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Monday, July 25, 2016

### Onsager-corrected deep learning for sparse linear inverse problems

Thomas let me know on Twitter that the Great Convergence continues, Today we find out how we go about changing the iterative process of AMP and then learn coefficients of that process as in Deep Learning. It looks like the Learned AMP beats LISTA. Looking back at the few COLT presentations earlier (Saturday videos), one wonders how these solvers change the rule of thumbs on model depth and size. To be continued....

Deep learning has gained great popularity due to its widespread success on many inference problems. We consider the application of deep learning to the sparse linear inverse problem encountered in compressive sensing, where one seeks to recover a sparse signal from a small number of noisy linear measurements. In this paper, we propose a novel neural-network architecture that decouples prediction errors across layers in the same way that the approximate message passing (AMP) algorithm decouples them across iterations: through Onsager correction. Numerical experiments suggest that our "learned AMP" network significantly improves upon Gregor and LeCun's "learned ISTA" network in both accuracy and complexity.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Saturday, July 23, 2016

### Saturday Morning Video: On the Expressive Power of Deep Learning: A Tensor Analysis, Nadav Cohen, Or Sharir, Amnon Shashua @ COLT2016

The preprint on which it relies is:

On the Expressive Power of Deep Learning: A Tensor Analysis by Nadav Cohen, Or Sharir, Amnon Shashua

It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.
other recent work by Nadav include:

Convolutional Rectifier Networks as Generalized Tensor Decompositions by Nadav Cohen, Amnon Shashua

Convolutional rectifier networks, i.e. convolutional neural networks with rectified linear activation and max or average pooling, are the cornerstone of modern deep learning. However, despite their wide use and success, our theoretical understanding of the expressive properties that drive these networks is partial at best. On the other hand, we have a much firmer grasp of these issues in the world of arithmetic circuits. Specifically, it is known that convolutional arithmetic circuits possess the property of "complete depth efficiency", meaning that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be implemented (or even approximated) by a shallow network. In this paper we describe a construction based on generalized tensor decompositions, that transforms convolutional arithmetic circuits into convolutional rectifier networks. We then use mathematical tools available from the world of arithmetic circuits to prove new results. First, we show that convolutional rectifier networks are universal with max pooling but not with average pooling. Second, and more importantly, we show that depth efficiency is weaker with convolutional rectifier networks than it is with convolutional arithmetic circuits. This leads us to believe that developing effective methods for training convolutional arithmetic circuits, thereby fulfilling their expressive potential, may give rise to a deep learning architecture that is provably superior to convolutional rectifier networks but has so far been overlooked by practitioners.

Inductive Bias of Deep Convolutional Networks through Pooling Geometry by Nadav Cohen, Amnon Shashua
Our formal understanding of the inductive bias that drives the success of convolutional networks on computer vision tasks is limited. In particular, it is unclear what makes hypotheses spaces born from convolution and pooling operations so suitable for natural images. In this paper we study the ability of convolutional arithmetic circuits to model correlations among regions of their input. Correlations are formalized through the notion of separation rank, which for a given input partition, measures how far a function is from being separable. We show that a polynomially sized deep network supports exponentially high separation ranks for certain input partitions, while being limited to polynomial separation ranks for others. The network's pooling geometry effectively determines which input partitions are favored, thus serves as a means for controlling the inductive bias. Contiguous pooling windows as commonly employed in practice favor interleaved partitions over coarse ones, orienting the inductive bias towards the statistics of natural images. In addition to analyzing deep networks, we show that shallow ones support only linear separation ranks, and by this gain insight into the benefit of functions brought forth by depth - they are able to efficiently model strong correlation under favored partitions of the input.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Saturday Morning Video: Benefits of depth in neural networks, Matus Telgarsky @ COLT2016

Representation Benefits of Deep Feedforward Networks by Matus Telgarsky
This note provides a family of classification problems, indexed by a positive integer k, where all shallow networks with fewer than exponentially (in k) many nodes exhibit error at least 1/6, whereas a deep network with 2 nodes in each of 2k layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated k times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Saturday Morning Video: The Power of Depth for Feedforward Neural Networks by Ohad Shamir @ COLT2016

The attendant paper is here:

The Power of Depth for Feedforward Neural Networks by Ronen Eldan, Ohad Shamir

We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Saturday Morning Video: Randomized Algorithms in Linear Algebra, Ravi Kannan @ COLT2016

As Sebastien pointed out the COLT 2016 videos are out. Here is one: Ravi Kannan on Randomized Algorithms in Linear Algebra

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.