Improving Variational Methods for Representation Learning and Generative Modelling of Speech
Hearing aids are clinically proven to improve the quality of life of hearing-impaired individuals. However, despite this, many people who could benefit from a hearing aid do not use one. The most common reason for this is that they perform poorly in noisy environments. Machine learning presents a collection of powerful methods for attenuating noise in speech signals. Moreover, state-of-the-art hearing aids are being designed with more powerful processors that can run machine learning algorithms.
Several key factors make the deployment of machine learning to hearing aids challenging. Firstly, many state-of-the-art machine learning methods have high resource requirements (e.g., GPUs or large amounts of memory). Hence, these resource-intensive models cannot be run in real-time in low-complexity systems. Secondly, methods need to be causal (i.e., do not depend on access to future information). More generally, an issue is that the objectives that speech enhancement/separation models typically optimise (e.g., L1 or L2) correlate poorly with perceptual quality.
The goal of this thesis is to develop improved models for attenuating noise in speech signals. Specifically, we aim to develop more principled models and better understand existing frameworks. Overall, the objective is to develop low-complexity models for attenuating noise in speech signals.
We begin by developing an improved autoencoder-based generative model, which we name the Bounded Information Rate Variational Autoencoder. This model resolves the posterior collapse issue with Variational Autoencoders by pre-specifying the information rate between the data and latent variables. In a further extension of this model, we relax the constraints on the distribution of latent variables. Hence, the variance of this distribution is constrained, but not the shape. We name this model the Variance Constrained Autoencoder (VCAE). The VCAE outperforms other state-of-the-art generative models in both reconstruction and generation quality.
In subsequent work, we applied the Variance Constrained Autoencoder to speech enhancement. The model is trained to reconstruct a clean target signal from the noisy input. Importantly, rather than enhancing large segments, we enhanced short blocks ($\tilde 4$ ms). This allowed our system to have a lower computation complexity and higher subjective performance than other state-of-the-art approaches.
Our final contribution is an empirical study of a speech separation system called Conv-TasNet. This model separates speech signals by applying multiplicative masks to the mixture signal in a learned over-parameterised transform domain. The phenomenon of interest is the structure of the learned transform. Specifically, an over-parameterised frequency domain transform is learned, where more emphasis is placed on the regions with more signal energy. Hence, the transformation resembles something that is perceptually motivated.
In the empirical study of Conv-TasNet, we first provide empirical evidence that shows the transform learned by Conv-TasNet prefers uncorrelated coefficients. This explains why a frequency domain transform is learned. Moreover, we show that performance can be improved over the base system by reducing the correlation of the transform space coefficients. In the second part of this contribution, we demonstrate a link between the flatness of the filterbank and the overall separation performance. Furthermore, we show that the density of the vectors affects the flatness of the frame. In other words, we provide evidence that the perceptual structure of the learned filterbank results from an implicit bias in stochastic gradient descent training.