Open Access Te Herenga Waka-Victoria University of Wellington
thesis_access.pdf (5.16 MB)

De novo Molecular Design using Deep Learning

Download (5.16 MB)
posted on 2024-01-23, 01:35 authored by Hoang Nguyen

Currently, the growth of data science, computer science, and artificial intelligence has transformed traditional drug discovery. The era of information has opened numerous opportunities for various research fields. The introduction of computer-aided stages (e.g., molecule generation, property prediction, virtual screening, etc.) into the drug discovery pipeline has immensely enhanced the success rate of finding promising molecules. Despite initial accomplishments, computer-aided drug discovery still needs to be significantly improved. Among the well-known topics in computer-aided drug discovery, 'de novo molecular design' is a highly focused topic that attracts a large number of researchers. De novo molecular design aims to excavate novel molecules from the huge chemical space that has not been fully exploited. Although there are various deep learning architectures proposed for molecule generation, each approach has certain limitations that need to be addressed. Additionally, since molecule generation is a random and non-directional process, finding drug candidates with desired properties from billions of molecules is almost infeasible. To tackle this problem, several optimization techniques were utilized to direct the generative model to produce `molecule of interest'. However, the property-optimized process restricts the 'creativity' of the generative model. Furthermore, it is a fact that not every desired property can be optimized because of insufficient data, and optimization-driving generation is computationally expensive. In such cases, using Quantitative Structure-Activity Relationship (QSAR) models is an alternative solution for identifying molecules with desired properties.

The overall goal of this thesis is to develop a generative model and a series of QSAR models for drug discovery. The generative model is used to produce novel molecules, while the QSAR models are used to virtually filter the molecules with desired properties. To achieve this goal, a range of computational techniques and interdisciplinary knowledge are employed in this thesis. First, we conducted a critical review of existing molecular representations, generative models, and property prediction models. The review is highly essential to providing readers with a fundamental understanding of de novo molecular design. The review analyzes the pros and cons of each molecular representation and summarizes the present development and challenges of molecular generation and property prediction tasks. Second, we investigated a novel deep learning architecture for de novo molecular design. The architecture is designed to process graph-structure data. The generative model developed using the proposed architecture can produce hypothetical molecules with high novelty and diversity. Experimental results indicated that our generative model can create drug-like molecules varying in size, scaffold, and properties.

Third, we proposed two novel deep learning architectures for molecular property prediction. These two architectures, including the Residual Graph Attention (ResGAT) Network and the Graph Convolution-Attention Network (GCoAtNet), are designed to process graph-structure data. Our findings demonstrated that ResGAT achieved competitive performance while GCoAtNet achieved higher performance compared to state-of-the-art architectures. Our models were benchmarked against these state-of-the-art models on nine molecular datasets. Finally, we used these proposed architectures to construct a generative model and two QSAR models. The generative model was driven to produce a large number of hypothetical molecules. Subsequently, these molecules were virtually screened to eliminate those with drug-induced liver injury (property 1}) and Cytochrome-P450-inhibitory (property 2) activities. For each property, we developed two QSAR models that can independently identify molecules with desired properties. The intersection set of molecules suggested by these two models was considered a short list of potential drug candidates. These shortlisted molecules can be sent to the chemistry lab for further investigation, i.e., structural optimization and modification, synthesis, and evaluation. The results demonstrated that these computer-designed molecules are synthesizable and suitable for further research.


Copyright Date


Date of Award



Te Herenga Waka—Victoria University of Wellington

Rights License


Degree Discipline

Computer Science

Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Degree Level


Degree Name

Doctor of Philosophy

Victoria University of Wellington Unit

Engineering at Victoria

ANZSRC Socio-Economic Outcome code

209999 Other health not elsewhere classified; 220403 Artificial intelligence; 240899 Human pharmaceutical products not elsewhere classified

ANZSRC Type Of Activity code

3 Applied research

Victoria University of Wellington Item Type

Awarded Doctoral Thesis



Victoria University of Wellington School

School of Mathematics and Statistics


Nguyen, Binh; Teesdale-Spittle, Paul