A Corpus of Modern Spoken Māori
The Māori Broadcast Corpus (MBC) is a representative corpus of contemporary spoken Māori. The corpus was designed and compiled, then used to identify and describe various aspects of the lexicon of modern spoken Māori. The corpus contains approximately one million words of running text across several text categories, selected and transcribed from Māori-medium broadcasts in 1995 and 1996. The broadcast sources were Te Reo Irirangi o te Upoko-o-te-ika, Radio New Zealand, and Television New Zealand. The corpus files with accompanying explanatory and descriptive information and word lists are available on the compact disk, which accompanies this document.
Initial analysis of the corpus identified 10,289 different word types in the 1,005,364 tokens, or running words of text. The particular focus of the analysis was on high frequency vocabulary, and on patterns of distribution. A small number of high frequency words provide most coverage of texts: 165 word types make up approximately 80% of all the words in the texts in the corpus; 200 word types give 82.4% coverage, 2000 give 97.62% coverage. This has implications for learners of Māori. Knowing the most frequent word types, their meanings and their uses, is crucial to the comprehension of Māori broadcast texts but also of other texts.
The analysis extended beyond the identification of word types, and explored word sense and word sense distribution in selected high frequency word types, using concordance data from the MBC. This analysis revealed that, in those instances where word types could be used as both function words and content words, the function word uses were far more frequent. There was a degree of polysemy in the word types examined, with some meanings far more frequent than others. Word senses were identified that have yet to be recorded in dictionaries. The analysis showed the potential of the MBC for adding to what is already known about the lexicon of Māori by providing frequency and other distributional information together with new words senses, currently absent in the available dictionaries and grammars of Māori. Implications of the MBC for the learning and teaching of Māori were discussed, and some applications to language learning and teaching were outlined. Future corpus-based research was suggested.