A Workflow for Genome Assembly and Annotation of Hāpuku (Polyprion oxygeneios)
Recent advancements in long read DNA sequencing methods have made whole genome sequencing (WGS) and assembly significantly easier for researchers with limited access to resources. In particular, the Oxford Nanopore Technologies (ONT) long read sequencing platform is unique in that both sequencing runs and the instrument itself are relatively inexpensive. Low cost sequencing capabilities mean that non specialists will come to routinely incorporate genome assembly and annotation into a broader range of projects and applications, necessitating the development of better computational workflows. In this thesis, a workflow for genome assembly and annotation using nanopore long reads is presented for the assembly of Hāpuku, a species being developed for aquaculture at NIWAs Northland Aquaculture Centre in New Zealand. The workflow utilises the high accuracy of the latest nanopore base calling models and incorporates the recent duplex and modified base calling models. A tutorial covering installation and details of the commands used is included, details often omitted from published papers. To demonstrate the workflow, nanopore long read sequencing was carried out on a single Hāpuku individual. A total of 158 Gb of reads were produced across two runs, using a GridION sequencer and a PromethION2 sequencer. The final polished assembly, using the Flye assembler and Medaka polisher, had a length of 744 Mb for a coverage of ~212. Genome quality metrics were high with an N50 of 20.7 Mb, an L50 of 15 and a BUSCO completeness of 98.9% using the most recent Actinopterygii gene database. Extremely low heterozygosity of 0.089 was observed in the genome, although the reliability of the long read SNP calling used needs further investigation. Annotation was carried out on the assembly to characterise repetitive regions, tRNAs, rRNAs and protein-coding genes in Hāpuku. A high repeat percentage of 47.8% was found, with the most common repeat type, DNA transposons, making up 17.5% of the genome. Also found were 139 rRNAs, 2,223 standard tRNAs as well as 31,872 putative genes (pre-filtering). A 16,509 bp mitochondrial genome was also obtained and was almost a perfect match to a previous mitochondrial genome assembly for Hāpuku. Distributions of modified bases 5-methylcytosine and 5-hydroxymethylcytosine frequency in repetitive and unique regions are presented, revealing increased methylation in repetitive regions but constant hydroxymethylation between the regions. Finally, synteny analysis of the genome against another Perciformes species suggests all chromosomes have been obtained in the assembly, with many of these dominated by a single scaffold. The workflow generated a high quality reference genome assembly for Hāpuku and permitted a range of annotations, demonstrating that functionally useful whole genome resources can be generated from DNA reads from a single low cost sequencing technology. The assembly contiguity and completeness were high even without additional contiguity information, indicating that long-read sequencing technology is a promising approach for de novo assembly projects. Finally, the genomic resources produced for Hāpuku from this workflow provide important first steps towards a whole-genome informed selective breeding program.