Machine Learning for Non-Intrusive Speech Quality Assessment

Hakami, Mouna

doi:10.26686/wgtn.16985584.v1

thesis_access.pdf (2.11 MB)

Machine Learning for Non-Intrusive Speech Quality Assessment

thesis

posted on 2021-11-11, 01:53 authored by Mouna Hakami

This thesis presents two studies on non-intrusive speech quality assessment methods. The first applies supervised learning methods to speech quality assessment, which is a common approach in machine learning based quality assessment. To outperform existing methods, we concentrate on enhancing the feature set. In the second study, we analyse quality assessment from a different point of view inspired by the biological brain and present the first unsupervised learning based non-intrusive quality assessment that removes the need for labelled training data.

Supervised learning based, non-intrusive quality predictors generally involve the development of a regressor that maps signal features to a representation of perceived quality. The performance of the predictor largely depends on 1) how sensitive the features are to the different types of distortion, and 2) how well the model learns the relation between the features and the quality score. We improve the performance of the quality estimation by enhancing the feature set and using a contemporary machine learning model that fits this objective. We propose an augmented feature set that includes raw features that are presumably redundant. The speech quality assessment system benefits from this redundancy as it results in reducing the impact of unwanted noise in the input. Feature set augmentation generally leads to the inclusion of features that have non-smooth distributions. We introduce a new pre-processing method and re-distribute the features to facilitate the training. The evaluation of the system on the ITU-T Supplement23 database illustrates that the proposed system outperforms the popular standards and contemporary methods in the literature.

The unsupervised learning quality assessment approach presented in this thesis is based on a model that is learnt from clean speech signals. Consequently, it does not need to learn the statistics of any corruption that exists in the degraded speech signals and is trained only with unlabelled clean speech samples. The quality has a new definition, which is based on the divergence between 1) the distribution of the spectrograms of test signals, and 2) the pre-existing model that represents the distribution of the spectrograms of good quality speech. The distribution of the spectrogram of the speech is complex, and hence comparing them is not trivial. To tackle this problem, we propose to map the spectrograms of speech signals to a simple latent space.

Generative models that map simple latent distributions into complex distributions are excellent platforms for our work. Generative models that are trained on the spectrograms of clean speech signals learned to map the latent variable $Z$ from a simple distribution $P_Z$ into a spectrogram $X$ from the distribution of good quality speech.

Consequently, an inference model is developed by inverting the pre-trained generator, which maps spectrograms of the signal under the test, $X_t$, into its relevant latent variable, $Z_t$, in the latent space. We postulate the divergence between the distribution of the latent variable and the prior distribution $P_Z$ is a good measure of the quality of speech.

Generative adversarial nets (GAN) are an effective training method and work well in this application. The proposed system is a novel application for a GAN. The experimental results with the TIMIT and NOIZEUS databases show that the proposed measure correlates positively with the objective quality scores.

History

Copyright Date

2021-11-11

Date of Award

2021-11-11

Publisher

Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains Copyright

Degree Discipline

Engineering

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level

Doctoral

Degree Name

Doctor of Philosophy

Victoria University of Wellington Unit

Engineering at Victoria

ANZSRC Type Of Activity code

890199 Communication Networks and Services not elsewhere classified

Victoria University of Wellington Item Type

Awarded Doctoral Thesis

Language

en_NZ

Victoria University of Wellington School

School of Engineering and Computer Science

Advisors

Kleijn, Bastiaan; Carnegie, Dale

Machine Learning for Non-Intrusive Speech Quality Assessment

History

Copyright Date

Date of Award

Publisher

Rights License

Degree Discipline

Degree Grantor

Degree Level

Degree Name

Victoria University of Wellington Unit

ANZSRC Type Of Activity code

Victoria University of Wellington Item Type

Language

Victoria University of Wellington School

Advisors

Usage metrics

Categories

Keywords

Licence

Exports