Resource and Performance Modelling of Hadoop Clusters Using Machine Learning
There is a huge and rapidly increasing amount of data being generated by social media, mobile applications and sensing devices. Big data is the term usually used to describe such data and is described in terms of the 3Vs - volume, variety and velocity. In order to process and mine such a massive amount of data, several approaches and platforms have been developed such as Hadoop. Hadoop is a popular open source distributed and parallel computing framework. It has a large number of configurable parameters which can be set before the execution of jobs to optimize the resource utilization and execution time of the clusters. These parameters have a significant impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches do not include the complex interaction between configuration parameters and the cluster environment changes such as use of different datasets or types of query. This makes it difficult to predict for example the execution time of a job or resource utilization of a cluster. Other attributes include configuration parameters, the structure of query, the dataset, number of nodes and the infrastructure used. Our first main objective was to design reliable experiments to understand the relationship between attributes. Before designing and implementing the actual experiment we applied Hazard and Operability (HAZOP) analysis to identify operational hazards. These hazards can affect normal working of cluster and execution of Hadoop jobs. This brainstorming activity improved the design and implementation of our experiments by improving the internal validity of the experiments. It also helped us to identify the considerations that must be taken into account for reliable results. After implementing our design, we characterized the relationship between different Hadoop configuration parameters, network and system performance measures. Our second main objective was to investigate the use of machine learning to model and predict the resource utilization and execution time of Hadoop jobs. Resource utilization and execution time of Hadoop jobs are affected by different attributes such as configuration parameters and structure of query. In order to estimate or predict either qualitatively or quantitatively the level of resource utilization and execution time, it is important to understand the impact of different combinations of these Hadoop job attributes. You could conduct experiments with many different combinations of parameters to uncover this but it is very difficult to run such a large number of jobs with different combinations of Hadoop job attributes and then interpret the data manually. It is very difficult to extract patterns from the data and give a model that can generalize for an unseen scenario. In order to automate the process of data extraction and modeling the complex behavior of different attributes of Hadoop job machine learning was used. Our decision tree based approach enabled us to systematically discover significant patterns in data. Our results showed that the decision tree models constructed for different resources and execution time were informative and robust. They were able to generalize over a wide range of minor and major environmental changes such as change in dataset, cluster size and infrastructure such as Amazon EC2. Moreover, the use of different correlation and regression techniques, such as M5P, Pearson's correlation and k-means clustering, confirmed our findings and provided further insight into the relationship of different attributes and with each other. M5P is a classification and regression technique that predicted the functional relationships among different job attributes. The use of k-means clustering allowed us to see the experimental runs that shows similar resource utilization and execution time. Statistical significance tests, were used to validate the significance of changes in results of different experimental runs, also showed the effectiveness of our resource and performance modelling and prediction method.