Open Access Te Herenga Waka-Victoria University of Wellington
Browse
thesis_access.pdf (2.58 MB)

Improving Medical Document Classification via Feature Engineering

Download (2.58 MB)
thesis
posted on 2024-04-10, 22:40 authored by Mahdi Abdollahi

Document classification (DC) is the task of assigning the predefined labels to unseen documents by utilizing the model trained on the available labeled documents. DC has recently attracted much attention in the medical field because many issues can be formulated as classification problems.

For example, categorizing clinical risk factors, automatic disease classification, and electronic health records classification are some applications of text classification. DC is critical for medical document management and analysis. Medical DC can assist doctors in decision making and correct decisions can reduce medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract knowledge. The current classification performance on medical documents is not satisfactory.

Furthermore, the source of data is not sufficient due to patients’ privacy. This thesis aims to enhance the input feature sets of the medical DC methods to improve their classification performance. Additionally, it develops new data augmentation methods to deal with the shortage of data. To approach these goals, this work has developed new feature manipulation methods (such as future extraction, feature selection, and feature construction) in supervised learning systems to extract new meaningful feature sets. Moreover, it develops ontology and dictionary-based data augmentation approaches to create new synthetic documents. This thesis utilizes Evolutionary Computation (EC) techniques such as Particle Swarm Optimisation (PSO) and other deep learning methods such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Hierarchical Attention Network (HAN) to achieve its objectives.

The main goal of this thesis is to develop new feature engineering approaches to medical document classification by using domain-specific knowledge of the problem which automatically extracts prominent features, constructing new high-level features, selects informative features, and augments new synthetic documents from the original documents. These methods can improve medical document classification performance by enriching the quality of the input data.

This thesis develops three feature engineering approaches including domain-specific feature extraction, two-stage and three-stage PSO-based methods to automatically extract, construct, and select new high-level features for classification. The results demonstrate that two-stage and three stage approaches outperformed the compared related works. This thesis proposes two novel ontology-based data augmentation approaches to make new synthetic documents from the original training data sets for medical document classification. These approaches can make new synthetic documents from the original documents by employing a domain-specific ontology and a general dictionary to double/triple the size of the training data set and improve the performance of medical document classification. The results show that these approaches successfully improved medical document classification performance.

This thesis develops two dictionary-based data oversampling approaches to make new synthetic documents from the original training data sets for medical document classification problems. The proposed approach can make new synthetic documents with high variety compared to similar methods. The proposed approaches make an imbalanced data set balanced and improve the classification performance too. The results show better classification performance.

History

Copyright Date

2021-12-14

Date of Award

2021-12-14

Publisher

Te Herenga Waka—Victoria University of Wellington

Rights License

Author Retains All Rights

Degree Discipline

Computer Science

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level

Doctoral

Degree Name

Doctor of Philosophy

Victoria University of Wellington Item Type

Awarded Doctoral Thesis

Language

en_NZ

Victoria University of Wellington School

School of Engineering and Computer Science

Advisors

Gao, Xiaoying; Mei, Yi

Usage metrics

    Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC