Improving Variational Methods for Representation Learning and Generative Modelling of Speech

Braithwaite, Daniel

doi:10.26686/wgtn.23462480

thesis_access.pdf (13.69 MB)

Improving Variational Methods for Representation Learning and Generative Modelling of Speech

thesis

posted on 2023-06-09, 15:19 authored by Daniel Braithwaite

Hearing aids are clinically proven to improve the quality of life of hearing-impaired individuals. However, despite this, many people who could benefit from a hearing aid do not use one. The most common reason for this is that they perform poorly in noisy environments. Machine learning presents a collection of powerful methods for attenuating noise in speech signals. Moreover, state-of-the-art hearing aids are being designed with more powerful processors that can run machine learning algorithms.

Several key factors make the deployment of machine learning to hearing aids challenging. Firstly, many state-of-the-art machine learning methods have high resource requirements (e.g., GPUs or large amounts of memory). Hence, these resource-intensive models cannot be run in real-time in low-complexity systems. Secondly, methods need to be causal (i.e., do not depend on access to future information). More generally, an issue is that the objectives that speech enhancement/separation models typically optimise (e.g., L1 or L2) correlate poorly with perceptual quality.

The goal of this thesis is to develop improved models for attenuating noise in speech signals. Specifically, we aim to develop more principled models and better understand existing frameworks. Overall, the objective is to develop low-complexity models for attenuating noise in speech signals.

We begin by developing an improved autoencoder-based generative model, which we name the Bounded Information Rate Variational Autoencoder. This model resolves the posterior collapse issue with Variational Autoencoders by pre-specifying the information rate between the data and latent variables. In a further extension of this model, we relax the constraints on the distribution of latent variables. Hence, the variance of this distribution is constrained, but not the shape. We name this model the Variance Constrained Autoencoder (VCAE). The VCAE outperforms other state-of-the-art generative models in both reconstruction and generation quality.

In subsequent work, we applied the Variance Constrained Autoencoder to speech enhancement. The model is trained to reconstruct a clean target signal from the noisy input. Importantly, rather than enhancing large segments, we enhanced short blocks ($\tilde 4$ ms). This allowed our system to have a lower computation complexity and higher subjective performance than other state-of-the-art approaches.

Our final contribution is an empirical study of a speech separation system called Conv-TasNet. This model separates speech signals by applying multiplicative masks to the mixture signal in a learned over-parameterised transform domain. The phenomenon of interest is the structure of the learned transform. Specifically, an over-parameterised frequency domain transform is learned, where more emphasis is placed on the regions with more signal energy. Hence, the transformation resembles something that is perceptually motivated.

In the empirical study of Conv-TasNet, we first provide empirical evidence that shows the transform learned by Conv-TasNet prefers uncorrelated coefficients. This explains why a frequency domain transform is learned. Moreover, we show that performance can be improved over the base system by reducing the correlation of the transform space coefficients. In the second part of this contribution, we demonstrate a link between the flatness of the filterbank and the overall separation performance. Furthermore, we show that the density of the vectors affects the flatness of the frame. In other words, we provide evidence that the perceptual structure of the learned filterbank results from an implicit bias in stochastic gradient descent training.

History

Copyright Date

2023-06-10

Date of Award

2023-06-10

Publisher

Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains Copyright

Degree Discipline

Engineering

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level

Doctoral

Degree Name

Doctor of Philosophy

Victoria University of Wellington Item Type

Awarded Doctoral Thesis

Language

en_NZ

Victoria University of Wellington School

School of Engineering and Computer Science

Advisors

Kleijn, Bastiaan; Frean, Marcus

Usage metrics

Keywords

Machine Learning Representation Learning Generative Modeling Speech Processing School: School of Engineering and Computer Science 170203 Knowledge Representation and Machine Learning Degree Discipline: Engineering Degree Level: Doctoral Degree Name: Doctor of Philosophy Knowledge Representation and Machine Learning

Licence

Author Retains Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving Variational Methods for Representation Learning and Generative Modelling of Speech

History

Copyright Date

Date of Award

Publisher

Rights License

Degree Discipline

Degree Grantor

Degree Level

Degree Name

Victoria University of Wellington Item Type

Language

Victoria University of Wellington School

Advisors

Usage metrics

Categories

Keywords

Licence

Exports