Monday, January 29, 2018

Intriguing Properties of Randomly Weighted Networks: Generalizing while Learning Next to Nothing

Intriguing indeed !

Training deep neural networks results in strong learned representations that show good generalization capabilities. In most cases, training involves iterative modification of all weights inside the network via back-propagation. In this paper, we propose to take an extreme approach and fix \emph{almost all weights} of a deep convolutional neural network in their randomly initialized values, allowing only a small portion to be learned. As our experiments show, this often results in performance which is on par with the performance of learning all weights. The implications of this intriguing property or deep neural networks are discussed and we suggest ways to harness it to create more robust representations. 
h/t Iacopo and Miles

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

2 comments:

SeanVN said...

I think the idea with generalization is to make the decision regions as large as possible consistent with the training data. Tishby noted a diffusion process occurring in neural networks trained by backpropagation leading to compression.
https://youtu.be/bLqJHjXihK8
I presume that diffusion process increases the size of the decision regions maximally.

I suppose just having a lot of random weights around automatically causes some broadening of the decision regions. Anyway it is interesting information for me. It provides some justification for evolving subsets of weights in a neural network rather than trying to evolve them all at once.
There is an interesting question whether or under what circumstances evolution can provide a similar diffusion process to backpropagation.

SeanVN said...

One other thing I should mention is information loss by non-linear functions. Linear transformation can always be organized to leave the information content almost completely intact. With standard floating point operations (IEEE) any non-linear (activation) function will always cause information loss (ie the first derivative will be very coarse):

https://www.spsc.tugraz.at/system/files/Geiger_InfoLossStatic.pdf

Layer after layer the non-linearity in a deep neural network compounds. And at each layer a significant amount of information about the input is lost. After a number of layers the network ends up on a set trajectory, the input information is no longer available (has been washed out) to cause further switching between decision regions.

In the system discribed maybe the random pathways through the network allow information about the input to pass through less hindered than if all the weights had been organized.

Printfriendly