Cluster Resource Scheduling: A Tale on Fairness and Efficiency

PhD Thesis Proposal Defence


Title: "Cluster Resource Scheduling: A Tale on Fairness and Efficiency"

by

Mr. Chen CHEN


Abstract:

With the burst of large-scale data analytics, it has become quite 
prevalent to run data-parallel jobs in clusters of distributed servers. To 
shared production clusters that host a variety of workloads, the resource 
scheduler is of paramount importance to the cluster performance. Regarding 
the scheduler, the two basic scheduling objectives are efficiency and 
fairness---an ideal scheduler shall provide optimal mean job response 
time, and meanwhile guarantee predictable job performance by avoiding 
starvation.

Practically however, performance optimality and predictability are 
conflicting with each other under today's cluster schedulers, leading to a 
dilemma of either obtaining predictable performance at the expense of long 
response time, or running the risk of starving some jobs to achieve 
minimal mean response time. As a result, it's critical to develop 
scheduling strategies that can do well in both worlds---which is the very 
focus of this thesis. Specifically, we make the following three key 
contributions.

First, we present Cluster Fair Queuing (CFQ), a cluster resource 
scheduling mechanism to minimize the mean job response time while 
achieving predictable performance. Noticing the inefficiency of 
instantaneous fair sharing, CFQ works by preferentially offering resources 
to jobs that finishes earliest under an idealized fair sharing policy.

Second, we reveal that service isolation for dependent data-parallel jobs 
is crucial for both fairness and efficiency, but has not been guaranteed 
in scenarios with fine-grained resource sharing, even when the jobs are 
assigned high priorities. We identify the reasons behind and propose 
Speculative Slot Reservation to achieve service isolation---by judiciously 
reserving slots for the downstream computations of jobs with inner 
dependencies.

Third, we observe an interesting property of data-parallel jobs---Demand 
Elasticity---that some jobs can run with much less resources than they 
ideally need without noticeable performance degradation. Then we propose 
Performance-Aware Fair (PAF) scheduling to exploit that property for 
speeding up overall job performance while ensuring near-optimal fairness. 
PAF works by iteratively transferring resources from certain jobs to 
others that can utilize those resource more efficiently, as long as the 
formers are not sacrificed over a small threshold.


Date:			Tuesday, 24 April 2018

Time:                  	3:00pm - 5:00pm

Venue:                  Room 4475
                         (lifts 25/26)

Committee Members:	Prof. Bo Li (Supervisor)
 			Dr. Wei Wang (Supervisor)
  			Prof. Lei Chen (Chairperson)
 			Dr. Ke Yi


**** ALL are Welcome ****