Backpropagation-free learning with an information surrogate
This thesis explores modern deep neural networks from an information-theoretical point of view. The main contribution of the thesis is a new andpotentially more efficient training framework alternative to conventionalend-to-end backpropagation training. The thesis develops and analyzesseveral architectures using this framework, including a new and morebiologically plausible learning architecture.
To address the computational difficulties of dealing with information-theoretical quantities, this thesis turns to an existing statistical technique,the Hilbert-Schmidt independence criterion (HSIC). HSIC is a non-parametrickernel method to characterize the (in)dependence of random variables. Inthis thesis, HSIC is used to computationally formulate and explore theinformation bottleneck principle. The Information bottleneck can be seen asa trade-off in the hidden representation between the information neededfor predicting the task-specific target, and the information retrained aboutthe input. The thesis explores the information bottleneck and its HSICformulation through the following ideas: Blind facial basis discovery. The use of HSIC as an approximate measure ofindependence is explored through a small problem resembling IndependentComponent Analysis (ICA) or blind basis separation. Three-dimensionalcomputer avatars are often implemented as a parametric model, wherethe parameters that control the facial expression are defined as part of aso-called blendshape system. However, high-quality avatar models areconstructed by laborious manual digital sculpting. The proposed methoduses HSIC as an ICA-like criterion to discover a distinct facial basis fromthe given facial animation. The result shows that an ICA criterion canbe simply and effectively implemented using HSIC regularization. In thevisual result, the proposed method successfully generates a distinct facialbasis. The use of HSIC in this chapter is then adopted in the rest of thechapters as the mutual information surrogate.
HSIC-bottleneck. The HSIC-bottleneck is an alternative to conventionalcross-entropy loss and backpropagation that has a number of distinct ad-vantages. The network is learned by fully localized bottleneck objectivesthat breaks the need for end-to-end training. Additionally, the network out-put representation can be used directly for classification regardless of thenumber of dimensions. This thesis shows that a very deep neural networkwithout the skip-connection techniques is learnable with fully localizedobjectives in HSIC-bottleneck framework, avoiding the learning difficultiesoften seen when backpropagation is applied to very deep networks with-out skip connections. The HSIC-bottleneck is the backbone concept of thisthesis. It will be extended and applied to several research problems in thefollowing chapters.
HSIC-subsampling. HSIC-subsampling provides an efficient samplingmethodology to accelerate HSIC computation. By taking advantage ofstochastic minibatch gradient descent learning, HSIC-subsampling is ca-pable of approximating the entire HSIC computation after a few trainingiterations. It can be directly applied to various objectives that involve HSICcomputation such as HSIC-bottleneck.
Predictive bottleneck. The predictive bottleneck is an extension of theHSIC-bottleneck idea. Rather than using an accurate (and potentially some-what expensive) dependency measure such as HSIC in the objective, thepredictive bottleneck uses the lightweight auxiliary networks to approx-imate the localized information bottleneck objective. Furthermore, thepredictive bottleneck demonstrates the ability to explore the relevant infor-mation in the training input that cannot be easily discovered by traditionallearning. Two prominent results are empirically demonstrated in theseexperiments. Firstly, a gray-scale dataset is manipulated by embedding themost significant bits into the least significant bits and replacing the mostsignificant bits with Gaussian noise. The predictive bottleneck can recog-nize the embedded information without memorizing the noise information.Secondly, the predictive bottleneck reduces the network’s sensitivity tonoise. The experiments demonstrate that the predictive bottleneck noisehas less impact on the network performance.
Biological information bottleneck (BioiB). BioiB re-thinks supervision indeep neural network learning. It demonstrates that it is sufficient to exposerelevant task information at an intermediate layer rather than at the outputlayer as in traditional supervised learning. Given suitable input features,local InfoMax and biological Barlow-like principles are sufficient for theunsupervised emergence of disentangled concepts suitable for classification,without output supervision. BioiB shows improved learning speeds andaccuracy when compared to existing biologically motivated methods onthe benchmarked datasets. Practically, the experiments show that existingneural models such as VGG16, VGG19, ResNet32 can be adapted to theBioiB framework, with performance comparable to or exceeding that ofbackpropagation and existing biologically plausible alternatives.
In summary, this thesis introduces a computationally attractive class of approaches to information-theoretic learning in deep networks, but also demonstrates the performance of these methods on well-known deep architectures and public datasets. In other words, the work presented here will be of interest both to the research community and to industrial production applications.