Wednesday, July 19, 2017

On Unlimited Sampling



Ayush Bhandari just let me know about the interesting approach of Unlimited Sampling in an email exchange:

...In practice, ADCs clip or saturate whenever the amplitude of signal x exceeds ADC threshold L. Typical solution is to de-clip the signal for which purpose various methods have been proposed.

Based on a new ADC hardware which allows for sampling using the principle 
y = mod(x,L)

where x is bandlimited and L is the ADC threshold, we show that Nyquist rate about \pi e (~10) times faster guarantees recovery of x from y. For this purpose we outline a new, stable recovery procedure.

Paper and slides are here.

There is also the PhysOrg coverage. Thanks Ayush ! Here is the paper:


Shannon's sampling theorem provides a link between the continuous and the discrete realms stating that bandlimited signals are uniquely determined by its values on a discrete set. This theorem is realized in practice using so called analog--to--digital converters (ADCs). Unlike Shannon's sampling theorem, the ADCs are limited in dynamic range. Whenever a signal exceeds some preset threshold, the ADC saturates, resulting in aliasing due to clipping. The goal of this work is to analyze an alternative approach that does not suffer from these problems. Our work is based on recent developments in ADC design, which allow for ADCs that reset rather than to saturate, thus producing modulo samples. An open problem that remains is: Given such modulo samples of a bandlimited function as well as the dynamic range of the ADC, how can the original signal be recovered and what are the sufficient conditions that guarantee perfect recovery? In this work, we prove such sufficiency conditions and complement them with a stable recovery algorithm. Our results are not limited to certain amplitude ranges, in fact even the same circuit architecture allows for the recovery of arbitrary large amplitudes as long as some estimate of the signal norm is available when recovering. Numerical experiments that corroborate our theory indeed show that it is possible to perfectly recover function that takes values that are orders of magnitude higher than the ADC's threshold.

h/t Laurent.

CSjob: Three postdoc positions, Paris, France

Lenka let me know of three postdoc positions:

If you know talented young researchers looking for a postdoc position, would you please forward them the following openings. One is in my group starting in 2018, and two on more applied subjects in collaboration with my group. 
Thanks in advance for your help. 








Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, July 18, 2017

Object classification through scattering media with deep learning on time resolved measurement


I love it ! Calibration Invariant Imaging.




We demonstrate an imaging technique that allows identification and classification of objects hidden behind scattering media and is invariant to changes in calibration parameters within a training range. Traditional techniques to image through scattering solve an inverse problem and are limited by the need to tune a forward model with multiple calibration parameters (like camera field of view, illumination position etc.). Instead of tuning a forward model and directly inverting the optical scattering, we use a data driven approach and leverage convolutional neural networks (CNN) to learn a model that is invariant to calibration parameters variations within the training range and nearly invariant beyond that. This effectively allows robust imaging through scattering conditions that is not sensitive to calibration. The CNN is trained with a large synthetic dataset generated with a Monte Carlo (MC) model that contains random realizations of major calibration parameters. The method is evaluated with a time-resolved camera and multiple experimental results are provided including pose estimation of a mannequin hidden behind a paper sheet with 23 correct classifications out of 30 tests in three poses (76.6% accuracy on real-world measurements). This approach paves the way towards real-time practical non line of sight (NLOS) imaging applications.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

RLAGPU: High-performance Out-of-Core Randomized Singular Value Decomposition on GPU - implementation -

When the GPU cannot handle your randomized SVD:. 


RLAGPU: High-performance Out-of-Core Randomized Singular Value Decomposition on GPU by Yuechao Lu, Fumihiko Ino, Yasuyuki Matsushita and Kenichi Hagihara

Randomized Singular Value Decomposition (SVD)[1] is gaining attention in finding structure in scientific data. However, processing large-scale data is not easy due to the limited capacity of GPU memory. To deal with this issue, we propose RLAGPU, an out-of-core process method accelerating large-scale randomized SVD on GPU. The contribution of our method is as follows: l Out-of-core implementation that overcomes the GPU memory capacity limit. l High-performance. In-core and out-of-core routines switched automatically according to data size and available GPU memory. We found that our proposed method outperforms the existing cuBLAS-XT by a margin up to 50%

An implementation is here: https://github.com/luyuechao/


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, July 12, 2017

Why is #Netneutrality Important to Data Science, Machine Learning and Artificial Intelligence ?

Image associée

So your ISP decides the speed or the kind of service you can get based on religion or what not. What happens to our field ? Because you have spent much time on the following services, you are throttled down or have to pay for "premium" services. As a result, you may or may not get to 

  • follow Andrew Ng's Coursera or Siraj Raval classes
  • submit your Kaggle results on time
  • read ArXiv preprints
  • read the latest GAN paper on time
  • watch NIPS/ICLR/CVPR/ACL videos
  • download datasets
  • pay more to use ML/DL on the cloud
  • share reviews
  • download the latest ML/DL frameworks
  • have access to your Slack channels 
  • read Nuit Blanche
  • follow awesome DL thread on Twitter
  • get scholar google alerts
  • .... 


 

SGD, What Is It Good For ?

Image Credit: NASA/JPL-Caltech/Space Science Institute, 
N00284488.jpg, Titan, Jul. 11, 2017 10:12 AM

As noted per the activity on the subject, there is growing interest in understanding better SGD and related methods, we mentioned two such study recently on Nuit Blanche

Sebastian Ruder updated his blog entry on the subject in An overview of gradient descent optimization algorithms (Added derivations of AdaMax and Nadam). In Reinforcement Learning or Evolutionary Strategies? Nature has a solution: BothArthur Juliani makes a mention of an insight on gradient based methods in RL (h/t Tarin for the pointer on Twitter)

It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training. On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.

But we also had people trying to speed SGD up while others put some grain of salt in the whole adaptive approach. We also have one where SGD helped by random features helps in solving the linear Bellman equation, a tool central in linear control theory. 
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency. 


Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

We introduce a data-efficient approach for solving the linear Bellman equation, which corresponds to a class of Markov decision processes (MDPs) and stochastic optimal control (SOC) problems. We show that this class of control problem can be cast as a stochastic composition optimization problem, which can be further reformulated as a saddle point problem and solved via dual kernel embeddings [1]. Our method is model-free and using only one sample per state transition from stochastic dynamical systems. Different from related work such as Z-learning [2, 3] based on temporal-difference learning [4], our method is an online algorithm following the true stochastic gradient. Numerical results are provided, showing that our method outperforms the Z-learning algorithm


Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by. This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use. In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Tuesday, July 11, 2017

Slides: Deep Learning and Reinforcement Learning Summer School 2017 @ MILA Montreal, Canada

The Deep Learning and Reinforcement Learning Summer School 2017 just finished and here are some of the slides presented there (videos should be coming later) 



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Monday, July 10, 2017

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Continuing the examination on SGD on the hardware side this time !


Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent by Christopher De Sa, Matthew Feldman, Christopher Ré, Kunle Olukotun

Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems. 



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

YellowFin: An automatic tuner for momentum SGD - implementation -

Here is an implementation for the optimization the momentum in the Stochastic Gradient Descent.


Hyperparameter tuning is one of the big costs of deep learning. State-of-the-art optimizers, such as Adagrad, RMSProp and Adam, make things easier by adaptively tuning an individual learning rate for each variable. This level of fine adaptation is understood to yield a more powerful method. However, our experiments, as well as recent theory by Wilson et al., show that hand-tuned stochastic gradient descent (SGD) achieves better results, at the same rate or faster. The hypothesis put forth is that adaptive methods converge to different minima (Wilson et al.). Here we point out another factor: none of these methods tune their momentum parameter, known to be very important for deep learning applications (Sutskever et al.). Tuning the momentum parameter becomes even more important in asynchronous-parallel systems: recent theory (Mitliagkas et al.) shows that asynchrony introduces momentum-like dynamics, and that tuning down algorithmic momentum is important for efficient parallelization.
We revisit the simple momentum SGD algorithm and show that hand-tuning a single learning rate and momentum value makes it competitive with Adam. We then analyze its robustness in learning rate misspecification and objective curvature variation. Based on these insights, we design YellowFin, an automatic tuner for both momentum and learning rate in SGD. YellowFin optionally uses a novel momentum-sensing component along with a negative-feedback loop mechanism to compensate for the added dynamics of asynchrony on the fly. We empirically show YellowFin converges in fewer iterations than Adam on large ResNet and LSTM models, with a speedup up to 2.8x in synchronous and 2.7x in asynchronous settings.
YellowFin: An automatic tuner for momentum SGD 

by Jian ZhangIoannis Mitliagkas and Chris Ré.
TLDR; Hand-tuned momentum SGD is competitive with state-of-the-art adaptive methods, like Adam. We introduce YellowFin, an automatic tuner for the hyperparameters of momentum SGD. YellowFin trains models such as large ResNets and LSTMs in fewer iterations than the state of the art. It performs even better in asynchronous settings via an on-the-fly momentum adaptation scheme that uses a novel momentum measurement component along with a negative-feedback loop mechanism.

Summary of results
(Figure) Comparing YellowFin to Adam on training a ResNet on CIFAR100 (left) synchronously; (right) asynchronously, using 16 workers.
Intro

.......

To try our tuner, get your fins on here for Tensorflow and here for PyTorch.


And earlier: Asynchrony begets Momentum, with an Application to Deep Learning by Ioannis MitliagkasCe ZhangStefan HadjisChristopher Ré

Asynchronous methods are widely used in deep learning, but have limited theoretical justification when applied to non-convex problems. We show that running stochastic gradient descent (SGD) in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration. Our result does not assume convexity of the objective function, so it is applicable to deep learning systems. We observe that a standard queuing model of asynchrony results in a form of momentum that is commonly used by deep learning practitioners. This forges a link between queuing theory and asynchrony in deep learning systems, which could be useful for systems builders. For convolutional neural networks, we experimentally validate that the degree of asynchrony directly correlates with the momentum, confirming our main result. An important implication is that tuning the momentum parameter is important when considering different levels of asynchrony. We assert that properly tuned momentum reduces the number of steps required for convergence. Finally, our theory suggests new ways of counteracting the adverse effects of asynchrony: a simple mechanism like using negative algorithmic momentum can improve performance under high asynchrony. Since asynchronous methods have better hardware efficiency, this result may shed light on when asynchronous execution is more efficient for deep learning systems.




Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, July 06, 2017

Beyond Moore-Penrose Part II: The Sparse Pseudoinverse

One of the finding, in recent years, us that when faced with underdetermined/undersampled problems, you can choose any regularization you want. The Advanced Matrix Factorization roster of techniques is a prime example of that. Today, we go beyond the traditional Moore-Penrose inverse !




This is the second part of a two-paper series on generalized inverses that minimize matrix norms. In Part II we focus on generalized inverses that are minimizers of entrywise p norms whose main representative is the sparse pseudoinverse for p = 1. We are motivated by the idea to replace the Moore-Penrose pseudoinverse by a sparser generalized inverse which is in some sense well-behaved. Sparsity implies that it is faster to apply the resulting matrix; well-behavedness would imply that we do not lose much in stability with respect to the least-squares performance of the MPP. We first address questions of uniqueness and non-zero count of (putative) sparse pseu-doinverses. We show that a sparse pseudoinverse is generically unique, and that it indeed reaches optimal sparsity for almost all matrices. We then turn to proving our main stability result: finite-size concentration bounds for the Frobenius norm of p-minimal inverses for 1 \le p \le 2. Our proof is based on tools from convex analysis and random matrix theory, in particular the recently developed convex Gaussian min-max theorem. Along the way we prove several results about sparse representations and convex programming that were known folklore, but of which we could find no proof.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, July 05, 2017

Slides: BioComp Summer School 2017 on bio-inspired computing hardware


The good folks at the GdR BioComp (@GDRBioComp)organized what looked like an awesome BioComp Summer School 2017 that was to provide an introduction to subject areas around bio-inspired computing hardware. Here are the slides shown at the conference. Photos taken during the school are here.


Invited speakers and slides

  • Chloé-Agathe Azencott, Mines ParisTech, France, Machine Learning and applications to genomics  
  • Geoffrey W. Burr, IBM Almaden Research Center, USA, Memristive neuromorphic hardware 
  • Elisabetta Chicca, Cognitive Interaction Technology Center of Excellence, Bielefeld Universit, Germany, Neuromorphic electronic circuits for building autonomous cognitive systems 
  • Steve Furber, School of Computer Science, University of Manchester, UK, SpiNNaker: ARM system-on-chip architecture  
  • Vincent Gripon, IMT Atlantique, Lab Sticc, France, Indexing, storing and retrieving data in neural networks  
  • Konstantin K. Likharev, Department of Physics and Astronomy, Stony Brook University, USA, Nanoelectronic/Superconductive Neuromorphic Networks  
  • Jean-Pascal Pfister, University of Zurich and ETH, Switzerland, Learning and inference at the level of single synapses and spiking neurons  
  • Panayiota Poirazi, Computational Biology Laboratory, Heraklion, Greece, Computational neuroscience: the role of dendrites in learning and memory  
  • Damien Querlioz, C2N CNRS, France, Computing with unreliable nanodevices  
  • Gregor Schöner, Institut für Neuroinformatik, Germany, Cognition in embodied and situated nervous systems  
  • Rufin VanRullen, CerCo, France, Brain oscillations and perception  





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Printfriendly