Browse
- No file added yet -

# Finite Mixture Models - A Divide and Conquer Approach

thesis
posted on 2022-10-09, 04:13 authored by Owen, Flynn

Within the field of data clustering, methods are commonly referred to as either 'distance-based' or 'mixture-based', with each having certain limitations. Distance-based methods are fast, but lack an underlying parametric model. Mixture-based methods are interpretable and generalise easily to distributions other than Gaussian, but are slow to estimate, especially when the number of clusters is large. Although generalisable to any distribution within the exponential family, we constrain our focus to only binary data, and thus consider row clustering of data from the Bernoulli distribution. Taking inspiration from divisive hierarchical clustering and develop four 'divide and conquer' mixture-based algorithms to rapidly estimate mixture models with large numbers of components. Our algorithms are: 1) The Hierarchical No Turning Back (HNTB) algorithm, 2) The Hierarchical Single Reallocation (HSR) algorithm, 3) The Hierarchical Reallocation Loop (HRL) algorithm 4) The Hierarchical algorithm that Reallocates During execution (HRD). Our algorithms work by recursively fitting a data matrix with R = 1 and R = 2 mixture components, and then partitioning this data matrix dependent on cluster membership, halting when no sub-components can be partitioned further. We then adapt the 'E-step' from the Expectation-Maximisation (EM) algorithm to reallocate data points across each estimated sub-component of the data matrix. We perform a series of Monte Carlo simulation studies on a series of generated datasets, and compare our results against an adapted version of well known 'Wolfe's method’ (Wolfe, 1971), which maximises the likelihood directly. We establish that 'divide and conquer' mixture modelling approaches are able to identify mixtures with large numbers of components at a much faster rate than our adapted version of Wolfe's method, and approximate both the true coefficients used to simulate the data, as well as the results from our adapted method, to a good degree. Algorithms introduced in this thesis provide suitable methods to address scalability with the ever increasing magnitude of data being stored and processed, while ensuring interpretability with the increasing demand for data-driven insights.

2022-10-09

2022-10-09

## Publisher

Te Herenga Waka—Victoria University of Wellington

## Degree Discipline

Statistics and Operations Research

## Degree Grantor

Te Herenga Waka—Victoria University of Wellington

Masters

## Degree Name

Master of Science

## ANZSRC Type Of Activity code

2 Strategic basic research

## Victoria University of Wellington Item Type

Awarded Research Masters Thesis

en_NZ

## Victoria University of Wellington School

School of Mathematics and Statistics

Arnold, Richard; McMillan, Louise

## Exports

RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC
figshare. credit for all your research.