Māori Vocabulary: A Study of Some High Frequency Homonyms
The problem addressed in this thesis concerns the accuracy of Māori language vocabulary counts, e.g Boyce (2006), where Māori was found to use a very small vocabulary in comparison with e.g. English. As Boyce (2006, ii) acknowledges, this is partly explained by the degree of homonymy in Māori, which undermines the accuracy of the count. Homonymy is the phenomenon of the same string of letters (word-form) having two or more unrelated meanings (e.g. kī ‘say’, ‘be full’). Automated word-form counts of Maori language texts count the form kī as the same word, regardless of its meaning. Unless different meanings of the same word-form are counted as different words, such counts will underestimate the vocabulary of the Māori language. (Homonymy is not the only explanation for the low count; further explanations have been suggested by Bauer (2009) and Nation (2011).) The thesis explores whether there are consistent clues in the linguistic environment that signal the correct interpretation of homonyms in texts, and if so, how such clues could be used for tagging corpora so that counting would be more accurate. The Boyce corpus of modern broadcast Māori (Boyce, 2006, ii) provided the data. Case studies were made of three high-frequency homonyms in this corpus, kī ‘say’, ‘full’, mea ‘say’, ‘thing’ and tau ‘settle’, ‘year’. Lyons' (1968) criterion of distinction was applied to establish the lexemes realised by each of these word-forms on the basis of dictionary and etymological information. The tokens of each word-form were then extracted from Boyce’s (2006) corpus using the concordance program ‘WordSmith Tools’. WordSmith Tools is a computer program that helps to look at how words behave in a text. Concord which is part of WordSmith Tools enables the user to see any word or phrase in context. Phrase peripheries (the words before and after each word-form in the same phrase) were analysed and the wider syntactic environment was also examined in order to find clues which signalled the appropriate lexeme for each token. The results showed that the lexemes from all three case studies could be identified in the corpus on the basis of consistent clues that occur in its linguistic environment. If the phrasal periphery of the word-form is examined, and the grammatical information supplied by the wider linguistic environment is taken into account, it is possible to determine the appropriate lexemic tag for a word-form in a corpus in Māori.