Scheduling and Resource Provisioning Algorithms for ScientificWorkflows on Commercial Clouds
Basic science is becoming ever more computationally intensive, increasing the need for large-scale compute and storage resources, be they within a High-Performance Computer cluster, or more recently, within the cloud. Commercial clouds have increasingly become a viable platform for hosting scientific analyses and computation due to their elasticity, recent introduction of specialist hardware, and pay-as-you-go cost model. This computing paradigm therefore presents a low capital and low barrier alternative to operating dedicated eScience infrastructure. Indeed, commercial clouds now enable universal access to capabilities previously available to only large well funded research groups. While the potential benefits of cloud computing are clear, there are still significant technical hurdles associated with obtaining the best execution efficiency whilst trading off cost. In most cases, large scale scientific computation is represented as a workflow for scheduling and runtime provisioning. Such scheduling becomes an even more challenging problem on cloud systems due to the dynamic nature of the cloud, in particular, the elasticity, the pricing models (both static and dynamic), the non-homogeneous resource types and the vast array of services. This mapping of workflow tasks onto a set of provisioned instances is an example of the general scheduling problem and is NP-complete. In addition, certain runtime constraints, the most typical being the cost of the computation and the time which that computation requires to complete, must be met. This thesis addresses 'the scientific workflow scheduling problem in cloud', which is to schedule workflow tasks on cloud resources in a way that users meet their defined constraints such as budget and deadline, and providers maximize profits and resource utilization. Moreover, it explores different mechanisms and strategies for distributing defined constraints over a workflow and investigate its impact on the overall cost of the resulting schedule.