POWER EFFICIENT SCHEDULING FOR

(1)

P

OWER

E

FFICIENT

SCHEDULING FOR

A

C

LOUD

S

YSTEM

Joachim Sjöblom

Master of Science Thesis

Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department of Information Technologies Åbo Akademi University September 2011

(2)

ABSTRACT

The aim of this thesis is to investigate the Linux Kernel and evaluate the available power-saving functionality. We propose changes to the current scheduler to improve said power-saving and explaining the role said improved scheduler would have in a Cloud System managed by a PID controller. We simulate scheduling for a number of different topologies and workloads using LinSched. This thesis presents results gathered for both the proposed scheduler and the current Linux scheduler and doing a comparison. The results indicate that the proposed scheduler has potential, but needs further work.

(3)

ABBREVIATIONS

ALB Active Load Balance

API Application Programmming Interface.

CFS Completely Fair Scheduler.

CPU Central processing unit.

FIFO First In, First Out.

HOS Host Operating System.

ILB Idle Load Balance/Balancer

LB Load Balancing or Load Balancer.

OGS Overlord Guided Scheduler.

PID Proportional-Integral-Derivative. PM Power Manager RBT Red-Black Tree. RQ Runqueue. RR Round Robin. RT Real-Time. RTS Real-Time Scheduling. SD Scheduling Domain. SG Scheduling Group.

(4)

SMP Symmetric Multiprocessing.

(5)

GLOSSARY

ALB A more aggressive type of LB. Requires at least one task per physical CPU and will push running tasks off the busiest RQ to accomplish this.

Busy Time Period of time during which the CPU is executing.

Cloud Platform for computational and data access services where details of physical location of the hardware is not necessarily of concern for the end user

CPU Will herein be used colloquially for both logical and physical CPUs, as well as cores. If a distinction is to be made then the whole term of the resource in question will be used.

Data center Facility that houses computer systems.

DVFS Dynamic Voltage and Frequency Scaling is used to adjust the input voltage and clock frequency according to the momentary need in order to avoid unnecessary energy consumption

Granularity Granularity describes the extent a system is broken down into smaller parts.

HZ Constant defining the timer interrupt frequency in the Linux Kernel.

ILB Used in tickless scheduling. Idle CPUs have their ticks turned off and will thus, if they are needed, have to be reactivated by the non-idle CPU that has been designated as the ILB.

Idle Time Period of time during which the CPU is completely idle.

Jiffies Variable which stores the amount of passed time quantum - normally one quantum is 10ms - since system bootup.

(6)

Scheduling Domain A scheduling domain is a set of CPUs which share properties and scheduling policies. Scheduling domains are hierarchical; a multi-level system will have multiple levels of domains.

Scheduling Group Each domain is split into scheduling groups. E.g. in a uniprocessor or SMP system, each physical CPU is a group

Server farm A server farm is a collection of servers. They are used when a single server is not capable of providing the required service

SMP In symmetric multiprocessing two or more processors are connected to the same main memory.

Tickless Scheduling Only use periodic ticks when a CPU is not idle.

vruntime The variable storing the amount of time a CFS task has spent on a CPU

(7)

LIST OF FIGURES

2.1 Example of a red-black tree [14] . . . 6

2.2 Structure hierarchy for tasks and the red-black tree [14] . . . 7

2.3 Illustrating the CFS RQ selection process. . . 8

2.4 SCHED_MC not in use [13] . . . 13

2.5 SCHED_MC in use [13] . . . 13

2.6 SCHED_SMT not in use [13] . . . 14

2.7 SCHED_SMT in use [13] . . . 15

2.8 Example of the Load Averages over a 24 hour period. [9] . . . . 19

3.1 A very simple example of a PID controller. . . 21

4.1 Illustration of how the weighted CPU load is assumed to affect perceived business . . . 34

5.1 LinSched Scheduler Simulator Architecture [15] . . . 45

6.1 Execution and delay times for a Dual Socket Quad CPU . . . 51

6.2 CFS and OGS Execution times ratioed for a Dual Socket Quad CPU . . . 52

(8)

6.3 Order of CPU allocation for a Dual Socket Quad CPU . . . 52

6.4 Tasks per CPU for a Dual Socket Quad CPU . . . 53

6.5 Execution and Delay Times per task for a Multi-core Quad CPU 54 6.6 Execution Times ratioed per task for a Multi-core Quad CPU . . 55

6.7 Order of CPU Allocation for a Multi-core Quad CPU . . . 55

6.8 Tasks per CPU for a Multi-core Quad CPU . . . 56

6.9 Execution and Delay Times per task for a Hex CPU with SMT . 57 6.10 Execution Time Ratios per task for a Hex CPU with SMT . . . . 58

6.11 Order of Allocation for a Hex CPU with SMT . . . 59

6.12 CPU Usage for a Hex CPU with SMT . . . 59

6.13 Order of Allocation for a Quad Socket Quad CPU . . . 60

6.14 Execution Time Ratios for a Quad Socket Quad CPU . . . 61

(9)

4 Proposed Scheduler Changes 24 4.1 Runqueue Selection . . . 25 4.1.1 find_idlest_group() vs find_busy_group() . . . 25 4.1.2 find_idlest_cpu() vs find_busy_sched_cpu() . . . 29 4.1.3 what_is_idle_time() . . . 31 4.2 Load Balancing . . . 34 4.2.1 find_busiest_group() . . . 35 4.2.2 find_busiest_queue() . . . 37 4.2.3 calculate_imbalance() . . . 40 4.2.4 move_tasks() . . . 42

5 Tools, Experimentation and Testing 44 5.1 Linsched . . . 44 5.1.1 LinSched Tasks . . . 45 5.1.2 Simulator Limitations . . . 46 5.1.3 Workarounds . . . 46 5.2 Simulated Workloads . . . 47 5.3 Power Consumption . . . 49 6 Results 50 7 Further Work 63 8 Conclusion 65

(11)

Bibliography 67

Swedish Summary 69

A Linux Kernel Change Log 76

(12)

CHAPTER

ONE

INTRODUCTION

A rapidly growing population, both online and offline, is creating an ever increasing demand for electricity. At the same time, an increasing climate awareness has caused a demand for higher and higher power efficiency in, among other things, computers. A lower power consumption is also more economical.

With the Internet and networking slowly becoming ubiquitous server clusters receive more and more requests, which means they have to be expanded in order to handle the increase. The expansion again will lead to more power being consumed, both due to there being more electronics to be powered and because said electronics require cooling and ventilation. Solving this problem simply by increasing the amount of servers in the farms is not sustainable in the long run, as said farms dissipate a sizeable amount of power even when not under heavy loads [1] [10]. Currently these server centres, generally speaking, consume more power than they have to. This is partly because the power consumption scaling is not proportional to the current load over all the boards and processors in the system [17]. That is to say, they have no way of turning parts of the system off if the part in question, based on the systemwide load, is not currently needed. This is where this thesis comes in.

Another student at Åbo Akademi University recently proposed a system where a Proportional-Integral-Derivative (PID) Controller would be used to calculate and control the number of processors (CPU) needed to meet the current demand on the system [11]. While this thesis does not actually concern itself with control theory, there is however a relationship between the

(13)

aforementioned controller and the performed work. This relationship as well as some basic controller theory will be explained in more detail in Chapter 3. At the time of writing it is quite common to use components engineered specifically for server systems. These components are thus built to be able to process a sizeable number of tasks which, since CPU power is proportional to the amount of energy dissipated, means they require quite a lot of energy even when not fully loaded. The proposed solution would utilise CPUs from the ARM family instead of the conventional server-grade CPUs. The ARM CPUs have been engineered with smartphones and embedded systems, which are both environments with inherently scarce resources, in mind. Subsequently their performance is not as good as that of a server-grade CPU, but on the other hand they require a lot less in terms of power. Thus using a large amount of ARM CPUs would give us a better granularity in terms of energy dissipated versus demand for CPU power [21]. If the system’s energy dissipation matches the demand for CPU power then we have a very energy efficient system [19].

1.1 Cloud Software project

The Cloud Software Program (2010-2013) is a SHOK-program financed through TEKES and coordinated by Tivit Oy. Its aim is to improve the competitive position of the Finnish software intensive industry in the global market. The content of this thesis is part of the project.

The research focus for the project in the Embedded Systems Laboratory at The Department of Information Technologies at Åbo Akademi is to evaluate the potential gain for energy efficiency by using low power nodes to provide services. In addition to energy efficiency the total cost of ownership for the cloud server infrastructure is in a central role.

1.2 Purpose of this thesis

This thesis will investigate the Linux kernel’s functionality in terms of scheduling and power-saving. The available functionality’s suitability for use in a Cloud Service with Controlled Power Management will be evaluated and some potential changes will be proposed.

The evaluation will herein be based on simulation results provided by Linsched which are visualised and analysed in Excel, as well as available kernel documentation.

(14)

1.3 Thesis Structure

Chapter 2 will cover the scheduling and power-saving functionality currently available in the Linux kernel starting with the Completely Fair Scheduler (CFS) and its design. This is then followed by a quick look at Real-Time Scheduling (RTS) and Load Balancing (LB). Lastly this chapter will cover the Governors, their purpose and how they accomplish it. Chapter 3 explains how the aforementioned Controlled Power Management is designed and will also clarify where this thesis fits in. Chapter 4 will explain how the kernel could be changed to improve the results achieved with the Power Management (PM), starting with Runqueue (RQ) selection. The LB is not quite ideal either and related change proposals will be mentioned next. As the PM will use a feedback loop, we will need the kernel to report a metric to the Controller and a few ways of accomplishing this will be explained next. The PM will also let the system know how many CPUs are needed to meet the current demand and thus the last part of this chapter will be a look at how this could be done. In chapter 5 we will look at the tools that were used for evaluating performance, as well as how the tests were constructed. Chapter 6 will cover the results and conclusion, while chapter 7 will list potential further work.

(15)

CHAPTER

TWO

SCHEDULING AND POWER MANAGEMENT

IN THE CURRENT LINUX KERNEL

One obvious place to start, when investigating the kernel’s scheduling and power-saving functionality, is the available schedulers. We are not currently interested in older versions of the kernel, so the old O(1) scheduler will not be taken into consideration, hence the focus on the Completely Fair Scheduler (CFS).

2.1 The Completely Fair Scheduler

While the O(1) scheduler solved many of the problems plaguing the pre-2.6 kernels - it eliminated the need to iterate through the entire task list to identify the next task to schedule, which vastly improved scalability and efficiency -it did require a large mass of code to calculate heuristics and was difficult to manage. These issues and other external pressures led to Ingo Molnar developing the CFS, which he based around some previous work by Con Kolivas [14]. CFS has also done away with the time slices which were used by previous Linux schedulers [16].

(16)

2.1.1 CFS Runqueue

Like its name implies, the main purpose of the CFS is to give all tasks a fair and balanced amount of CPU time. To determine the amount of time each task should be allocated, the CFS keeps track of the amount of time each task has already spent on a CPU in a per-task variable called virtual runtime. Thus the smaller the virtual runtime, the less time a task has spent on a CPU and, subsequently, the higher its need for CPU time. CFS is also capable of ensuring that tasks that are not currently runnable, e.g. tasks waiting for user input, will receive a comparable share of CPU time when they actually need it [14]. CFS does not use the same type of queue structure the previous Linux schedulers have though, but rather maintains a time-ordered red-black tree (RBT). RBTs are self-balancing, which means that there is never a path in the tree that is more than twice as long as any other, and operations on the tree occur in O(log(n)) time, where n is the number of nodes in the tree. This means that tasks can be inserted or deleted quickly and efficiently. Figure 2.1 below is an example of an RBT as it may look when used in CFS. The further to the left in the tree a task is stored, the greater its need for CPU time. Conversely, the further to the right, the lesser the need. To maintain the aforementioned CPU time balance CFS chooses the leftmost task as the next task to be scheduled. When the task is done, it adds the time it spent on the CPU to the virtual runtime and, if still runnable, is then inserted back into the tree. The self-ordering attribute combined with the scheduler always updating the virtual runtime according to the amount of time spent on a CPU and always choosing the leftmost task in the tree will ensure we have a balanced CPU time across the whole set of runnable tasks [14].

All tasks in Linux are represented by a task structure called task_struct, wherein the tasks’ current state, its stack, process flags, static and dynamic priorities are stored. This structure does not include anything CFS-related however, so with the introduction of CFS a new structure for tracking schedul-ing information called sched_entity was also introduced. Hierarchically speaking, as can be seen in Figure 2.2, the task_struct sits on top and fully describes the task. It includes the sched_entity structure. The sched_entity then, as mentioned, stores CFS-related scheduling information and what is probably the most important element; the vruntime. It serves to keep track of the time a task has spent on a CPU and also, by extension, as the RBT index. On the next level down, we then find the cf s_rq struct of which there is one per RQ and holds information about its associated RBT [14] [16].

(17)

Figure 2.1: Example of a red-black tree [14]

2.1.2 Runqueue Selection

When a new task is created, the scheduler has to select a RQ onto which to schedule it. This is done by calling the select_task_rq function in sched.c. The function call is actually scheduling class-specific and the algorithm that is actually used to select the RQ will depend on the task’s type. Tasks classified as RT tasks will thus use a different algorithm from the ones that are not. We are currently not interested in RT tasks however, and will focus on the task classification handled by the CFS task selection algorithm [5].

The select_task_rq call in sched.c will call a function with the same name located in the file sched_f air.c, where the concrete RQ selection decisions are made. We begin by iterating through the scheduling domains looking for a suitable target domain. The chosen target domain will be the lowest domain fulfilling all the selection criteria - i.e. the last domain in the for-loop to fulfill said criteria. The aforementioned criteria are dependent on the flags set for the current scheduling domain and will thus change slightly depending on SD conditions outside this function’s control [6].

As is illustrated in Figure 2.3, once a target SD has been found, we enter a while-loop and start out by double-checking the suitability flags. If the checks fail we move to a lower domain and try again. When the checks

(18)

Figure 2.2: Structure hierarchy for tasks and the red-black tree [14] have been cleared we look for the idlest group in the domain by calling f ind_idlest_group(). This function is quite simple and essentially just tallies the load for each scheduling group (SG) in the current domain, selects the idlest group and returns it. If N U LL is returned, i.e. if no suitable SG was found, we move on to a lower SD and try again. When a SG is returned we then want to find the idlest CPU, which is done by calling f ind_idlest_cpu(). This function is also quite simple, as it just iterates through the CPUs in the SG and selects the one with the lowest weighted_cpuload, which it then returns. Again, if no suitable candidate is returned, we move on to a lower SD and try again from the top of the while-loop. Then, before returning the CPU onto which the new task is to be scheduled, we do one last thing, namely make sure there is no lower level on the CPU which would be a more suitable candidate. If we do find one, then we iterate through the while-loop again. If we don’t; the while-loop breaks, the chosen CPU is returned and we are done. This being the core of CFS’ RQ selection thus means that all CPUs will utilised equally, which, when considering how we would like to save power, is not ideal. The

(19)

functionality is actually the exact opposite of our intentions [6].

Figure 2.3: Illustrating the CFS RQ selection process.

2.1.3 Load Balancing

Since the set of runnable tasks is not constant while the system is running, the conditions for an ideal task distribution are bound to change. When it does we will need some way of redefining the task distribution, which is where the load balancer (LB) comes in. Naturally the use of a load balancer is redundant for systems with only one CPU.

The Linux scheduler is built to tick periodically, with the length of the tick period being dependent on the HZ value used. The HZ defines how frequently we get a timer interrupt. Traditionally, up until kernel version 2.4, the HZ value has been 100, which equals one interrupt every 10ms. Nowadays

(20)

the default value is on the order of 1000, which equates to an interrupt interval of 1ms [8]. On every interrupt the timer code calls scheduler_tick() in sched.c, at which point we perform some scheduling-related maintenance. We will focus on the parts related to LB, since the other parts are not relevant for what we want to accomplish. As mentioned before, LB is not necessary if we are not using an SMP system. In the CFS case the LB process begins with trigger_load_balance() in sched_f air.c being called from scheduler_tick() in sched.c.

Tickless Scheduling

Linux has since at least Kernel version 2.6.21 been able to do tickless scheduling. Using tickless will currently mean the disabling of periodic ticks for idle CPUs [18].

Normally, the CPU will receive a periodic interrupt, which the scheduler can use as a wake-up call and move from having been idle if need be. Consequently all CPUs in the system will be woken up or interrupted periodically, even if they are idle. This is not ideal, as idle CPUs should be allowed to stay idle for as long as they are not needed, hence the concept of tickless scheduling. With the interrupts, which are essentially the same as ticks in this context, turned off a CPU will be able to stay idle for a lot longer. This scheme does mean the idle CPUs have no way of waking up on their own, which is a problem. The solution, since it is relatively closely related to LB, will be explained to some extent in the following section [18].

The Load Balancing Procedure

1. scheduler_tick() in sched.c is called periodically by the timer code as mentioned above.

2. trigger_load_balance() in sched_f air.c is called from #1. The functionality herein depends on whether or not we are using tickless scheduling. Having to concern ourselves with whether or not we use tickless scheduling when the function call which brought us into this function depends on ticks may sound counter-intuitive. This is simply due to tickless’ having turned off interrupts for idle CPUs, and their need for being looked after. If we are indeed using tickless scheduling, then idle CPUs rely on a specific and herein defined CPU, called "idle load balancer" (ILB), for their LB. At the top of this function we do a check to see if the ILB CPU should change. We also make sure the CPU is not idle,

(21)

as we may not need it and thus return if we do find it to be idle. If we pass all these checks, or if we simply are not using tickless scheduling, we check to see if it is time to perform LB for this RQ and CPU. If it is, then a soft IRQ is raised after which the next function in the process is called.

3. run_rebalance_domains() in sched_f air.c is called from #2. This func-tion’s functionality is also, to a degree, dependent on whether or not we use tickless scheduling. First off we will do LB for the current CPU. However, this function does not really do anything in that regard. We will simply find out the CPU ID and get the RQ and idle type of the CPU in question, after which rebalance_domains() is called. When we return from that function - this is either when the LB is finished, or if we fail one of the checks on the way - we then move on to do ILB for the idle CPUs if this CPU is responsible for it.

4. rebalance_domains() in sched_f air.c is called from #3. This function will check each SD to see if it is due to be balanced and will initiate balancing procedure. Thus most of what this function does is to calculate how long it’s been since the last LB and check if we have exceeded the threshold value. It will also verify the need for serialisation, or lack thereof. We then call load_balance() for the CPUs and SD for which it is required. 5. load_balance() in sched_f air.c is called from #4. Here is where we

actually start looking at how balanced the CPU is within its domain. Quite a bit of the functionality here is related to an obscure branch of the scheduling statistics which do not seem to have any impact on the actual LB. The parts we are interested in are f ind_busiest_group(), f ind_busiest_queue() and move_tasks(), which are called in that order. 6. f ind_busiest_group() is called from #5. The result is returned to #5. It

does pretty much what the name implies. It finds the busiest group in the SD. To start with we need to make sure we have up to date statistics to work with. These are not the same statistics as in #5, but rather a set that lets us know which SG is busiest, which SG is idlest as well as overall SD load and power. Next we perform a number of checks, which, if true, tell us there is no imbalance from the point of view of the current CPU and we jump to out_balanced. These cases are quite logical and intuitive. We will explain these in further detail in Chapter 4.2. If none of the cases apply we move on to calculate_imbalance(), which does what the name implies. Based on the average load per task in the SG, it calculates the amount of weighted load we have to move in order to equalise the perceived imbalance and stores the value. This value is

(22)

stored in imbalance before returning to f ind_busiest_group, which then returns the busiest SG to #5.

7. f ind_busiest_queue() is called from #5 using the result from #6. The result is returned to #5. Before this function is called, we will make sure that the value returned from #6 actually points to a SG. Provided it does, we then proceed to look for the busiest RQ in the SG. This is quite simply done by iterating through all the RQs in the SG and returning the one with the highest weighted load. The RQ is then returned to #5.

8. move_tasks() is called from #5 using the information given by #6 and #7. If the busiest RQ has only one task running, then attempting to move it is pointless and this step is skipped. Otherwise both the target and the busiest RQs will be locked and we attempt to equalise the imbalance by pulling tasks to the target RQ. The actual pulling is done through another sequence of calls starting from this function by calling load_balance_f air(), but they only make sure we move tasks that are actually moveable and thus do not affect the choice of RQ and SG. Due to this they are not of any immediate interest here. We then iterate the do-while loop in move_tasks(), with each iteration reporting the amount of weighted load that was moved, until we have moved the amount specified in imbalance by #6.

When #8 is finished it returns to #5, where the previously locked RQs are now unlocked and we then proceed to check whether or not the move was successful. If it was not then we check whether or not it was because all tasks were pinned - pinned tasks are flagged during task pulling - or if it is because active load balancing (ALB) is required. ALB is the more aggressive and less often used alternative to LB. It requires at least one task to be running on each physical CPU where possible and will pull running tasks off the busiest CPU in order to accomplish this. Other than being more aggressive, only moving one task to an idle CPU and looking at different flags, it does not really differ from normal LB [7].

If we did not need to perform ALB and simply failed to move tasks, then a negative value is returned to #4. The only thing the return value affects at this point is whether or not the CPU’s idle type is changed. If we successfully moved one or more tasks, then the target CPU is obviously no longer idle and should be flagged as such. If we were unsuccessful, then we leave the target CPU flagged as being in the state it was at the beginning of the LB procedure. We then proceed to update the time when the next LB is due for the current SD. With this the LB procedure is finished.

(23)

The end result of the CFS’ default LB procedure, detailed above, is normally an even task distribution over the CPUs available in the system. Therefore using the default CFS will not save us any power and as such it is unsuitable for our purposes.

The CONFIG_SCHED_MC Setting

As was mentioned previously; when used in its default mode, the CFS will attempt to essentially optimise performance by distributing tasks equally over all available CPUs. While this, combined with the next task being chosen based on its need for CPU time, does help overall performance, it does nothing to improve power-saving. Attempting to use all resources equally is essentially the opposite of what we are trying to achieve. This, however, does not mean that CFS is completely incapable of saving power, just that it by default does not.

When building a kernel we are given a large number of options so as to let us tailor the build to our needs. One of these options is called SCHED_M C and appears in the Processor Type and Features section of the Kconf ig file. This means that the option has to be chosen at build-time in order for the setting to be included and useable. To paraphrase the help print in the Kconf ig file it is used for: Multi-core scheduler support improving the CPU scheduler’s decision making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places [2]. What SCHED_M C then does is to attempt to consolidate our running tasks to as few cores as possible in order to save power [13].

Figure ?? shows the utilisation rate of a system with 8 CPUs with SCHED_M C turned off. We can see that the CPUs are on average utilised to 10%. All the CPUs are used to some extent here even though it is clear we could make do with only a subset of them. Figure ?? is an example of the same system and the same conditions, but this time with SCHED_M C turned on. Now we can see that we only use half of the available CPUs in the system and have let the other half go idle. Although based on the Kernel code, and Figures ?? and ??, the SCHED_M C setting only seems to work for relatively low loads and only in a few cases. By selecting the appropriate option during build-time, we activate the functionality of a number of scheduling-related functions. Apart from adding elements necessary to keep track of the busiest SGs, we also activate the functionality in check_power_save_busiest_group(), which is called from f ind_busiest_group() - i.e. #6 in the LB procedure covered above - in four of the five cases when no imbalance exists from the point of view of the current

(24)

Figure 2.4: SCHED_MC not in use [13]

Figure 2.5: SCHED_MC in use [13]

CPU. It is not called if the current CPU is not the appropriate CPU to perform LB at this level. The call to check_power_save_busiest_group() then results in the idlest SG being returned as the busiest, which, as it will then be used in the normal LB procedure as the busiest group, will result in our ending up pulling tasks from a RQ with a low load. In short then; by using the idlest SG’s busiest RQ and pull tasks from there to the target RQ, we will in the long run siphon off all tasks from barely loaded CPUs and can then let them go idle as was

(25)

seen in Figure ?? [13]. It is however unclear as to how well this works, and due to the non-existent load weight documentation it is also, due to the tight time constraints of this thesis, difficult to design any improvements. We will attempt to measure SCHED_M C’s performance using LinSched. The results will be covered in Chapter 6.

The CONFIG_SCHED_SMT Setting

The CON F IG_SCHED_SM T setting is quite similar to the SCHED_M C, but where SCHED_M C concerns itself with tasks, the SCHED_SM T will attempt to consolidate hyperthreads onto as few CPUs as possible. The same conditions apply to its usage as for SCHED_M C. Namely that it has to be selected when the Kernel is built in order for it to be included. It appears just before SCHED_M C in the same section of the Kconf ig file. To quote the help print from Kconf ig: SMT scheduler support improves the CPU scheduler’s decision making when dealing with Intel Pentium 4 chips with HyperThreading at a cost of slightly increased overhead in some places [2] [13].

(26)

Figure 2.7: SCHED_SMT in use [13]

Figure ?? illustrates how a system with 16 CPUs might be utilised with the SCHED_SM T option turned off. Again we have an average utilisation of 10% at this stage. Then in figure ?? we have turned on the SCHED_SM T option and can now see - or should have been able to see, had the figure’s colours not been so poorly chosen - that almost all of the threads have been successfully consolidated to four of the CPUs. The other 12 are barely utilised. Using SCHED_SM T does not seem to have much of an impact on the scheduling code.Unlike SCHED_M C, and based on what we have been able to find in the Kernel code, it has the biggest impact on which SD flags are set and also plays a small role when choosing an ILB.

It is also in this case unclear how well the functionality performs and how to potentially improve on it due to lacking documentation. Both of these settings also seem to be exclusive for the X86 architectures, and since we intend to use ARM CPUs we will not be able to use either the SCHED_M C option or the SCHED_SM T option. They can, however be used to measure power-saving performance against, as they seem to be the only energy efficiency-related scheduling functionality of currently available.

(27)

2.2 Governors

Another thing available in the current Kernels is the so called CPUfreq subsystem. Since its introduction in the 2.6.0 Kernel it has made it possible to dynamically scale CPU frequencies. By making use of Governors to detect how much power a CPU actually needs and adjusting the frequencies to match, the subsystem can allow for increased power- saving with a negligible sacrifice in performance [12].

Like the SCHED_M C and SCHED_SM T settings, we have to include the subsystem at build-time to be able to use it. The choice will appear in the CPU Frequency scaling section of the Kconf ig. If we choose to include the CPUfreq subsystem, we should also specify which of the five available Governors we wish to include. It is possible to include all five and switch between them depending on your need after the Kernel has been built. These five governors are [12]:

• The Performance Governor, which is statically set to the highest available frequency. The only available tunable here is what the Governor perceives as the maximum.

• The Powersave Governor, which is statically set to the lowest available frequency. It is essentially the inverse of the Performance Governor. Using it is normally not a good idea, as it can actually lead to an increased power consumption. All tasks take longer to complete at lower frequencies, so while keeping the frequency at a minimum will seemingly save us power, it will actually just make the CPU work slower, and may in the long run dissipate more power than it saves.

• The Userspace Governor, which lets us manually select and set frequen-cies. It is a slightly trickier Governor to use, as it bases all its decisions on what we as users have told it to do. Hence the name. It does have a number of daemons - essentially pre-made sets of rules - to make using it easier however.Although looking at this from another angle, we can see that this is the most flexible and customisable governor available. Where the other governors only react to and work with the current load, the Userspace Governor can be programmed to react to environmental changes. If you pull your laptop’s wall plug, the Userspace Governor can detect this and switch to a more aggressive power-saving policy for instance. This type of functionality is more complex than we need [12]. • The Ondemand Governor, which looks at the CPU utilisation and

(28)

than a threshold value, the governor steps down the frequency until the frequency which corresponds best to the current utilisation is found. When the utilisation exceeds the threshold value, the governor sets the highest available frequency. Using this governor lets us define the range of available frequencies, the governor’s sampling rate, as well as the utilisation threshold. This governor would be a good choice for power-saving. It is both autonomous, as opposed to the Userspace Governor, and the fastest to react to changes. However, jumping straight to the highest frequency when the threshold value is exceeded could lead to unnecessary spikes in power dissipation [12].

• The Conservative Governor is quite similar to the Ondemand Governor. The main difference between the two is that the Conservative Governor does not jump straight to the highest frequency when its threshold value is exceeded, but rather steps the frequency up gradually. This gives us both a finer granularity and fewer power dissipation spikes, although it does have a slight adverse effect on performance in some cases. If we quickly go from being barely utilised to fully loaded, then the Conservative Governor will iterate through all frequency steps before it gets to the highest frequency, for instance. Whereas the Ondemand Govenror will reach the highest frequency in one leap and thus have less of an impact on performance. This governor would be another fine choice for our power-saving needs [12].

Based on attributes and descriptions, it would seem that our choice stands between the Ondemand and the Conservative Governor. It is impossible to say which one of the two strikes a better balance between power-saving and performance simply based on their descriptions however, so it would be most prudent to run extensive tests before making the ultimate decision. A governor would be recommended in either case, as it is a completely autonomous function, does not affect scheduling and gives us more a graunlar control over delivered and demanded frequency. Either way we look at it; using a governor will help us save power.

2.3 CPU Load Average

The CPU Load Average is a decaying moving average calculated in the Kernel to give us an idea of the average CPU Load over different intervals. From the shortest to the longest, we have a 1-minute, a 5-minute and a 15-minute average. They measure the trend in CPU utilisation, instead of taking snapshots like the CPU percentage, and include all demand for the CPU, not

(29)

only how much the CPU was active at the time of measurement. Their values should be taken in context of the number of CPUs in the system; meaning that a CPU Load Average of 1 is an ideal load for one CPU, where a Load Average of 2.7 is more than two, but less than three CPUs can handle.

The averages then illustrate how loaded the CPU has been over their respective intervals. Obviously the 1-minute average will be the most responsive and look quite spiky compared to the slower ones. Looking at this average, we should be able to pick out bursts in the CPU Load, although the spikes won’t be as pronounced as in a snapshot. The 5-minute average’s curve is already a lot smoother and a lot less sensitive to bursty loads. Where it may not be prudent to panic when the 1-minute average is showing an upward trend, an upward trend in the 5-minute average should be taken more seriously. As it works over a longer time, only an actual overall load increase will cause the 5-minute average to move upward. Similarly the 15-minute average’s pointing up is cause for concern. At least when we start approaching the upper limit of our system. The CPU Load has to increase quite steadily for the 15-minute average to react. As can be seen in Figure 2.8, the 15-minute average is quite steady over the whole measurement period, but the 1-minute average demonstrates quite dramatic fluctuations. The 5-minute average, albeit somewhat difficult to see in the figure, follows the 15-minute average more closely than its 1-minute counterpart. This can be interpreted as the system being quite capable of handling the loads it was subjected to over the measurement period. Had the 15-minute and 5-minute averages looked more like the 1-minute average, or if the longer averages had displayed plateaus at an arbitrary point greater than 1 over the measurement period, we could have concluded the average load was larger than one CPU in hte monitored system could handle.

In short then; the Linux scheduler does not schedule tasks in a way that saves power. Turning the power-saving functionality, SCHED_M C and SCHED_SM T , on will help the situation somewhat, but it is unclear to what degree it helps, it will affect the LB rather than RQ selection for new tasks, and the documentation is lacking. We will thus have to measure the performance empirically. Also, the CPU Load Average will not inherently help us save any power, but it could assist the Controlled PM by acting as a feedback value. We will cover how in more detail in Chapter 3.

(30)

(31)

CHAPTER

THREE

CONTROLLED POWER MANAGEMENT

As was previously mentioned, a student here at Åbo Akademi University has suggested a power management scheme based on control theory, and more specifically PID controllers. While the work in this thesis does not concern itself with any actual control theory, it will be an important part of the actual implementation. Hence, to be able to accurately explain the role this work will play, it is of some import to briefly cover how PID controllers work. We will gloss over quite a few of the details, so the control theory covered here will be very basic, simply explaining what a PID controller consists of and what the different parts do. We will not cover any of the mathematics involved in making it work.

3.1 Basic PID Control Theory

PID control is a concept we normally associate with machinery, cruise control, or similar concepts where there is something concrete and obvious to actually control. In a cruise controller, the PID makes sure the car keeps a certain speed, whether it is going up a hill or down, by controlling the throttle. If we use the cruise controller as an example here, then it essentially accomplishes its task by feeding the difference between the expected speed (input) and the current speed (output) into the PID, which then calculates how much we potentially have to correct our speed by to achieve the target speed.

(32)

Figure 3.1: A very simple example of a PID controller.

Figure 3.1 shows us a simplified block diagram of the aforementioned PID controller setup. The input block is assumed to contain all sensors and hardware necessary to produce an input value, i.e. the target speed, from which we then subtract the fed back value, i.e. the current speed, to get the error value. The error value is then sent to the PID. The resulting equation might look something like:

e = r − y (3.1)

where e (error) is the PID input value, r is the target value, and y is the output and feedback value. From the PID block we then get the sum of three calculations.

• Proportional calculation

Kpe(t) (3.2)

where Kp is the proportional gain, and e(t) is the error at time t. Increasing the Kp value will result in a larger error value and therefore a more drastic correction. Conversely, if Kp is small, we will see a less drastic correction to an error.

• Integral calculation

Ki Z t

0

e(τ )dτ (3.3)

where Ki is the integral gain, and e(τ ) the error integrated from time 0 to now. The integral will contain the sum of all previous instantaneous errors and will therefore help us reach the target value quicker, provided we have chosen a proper Ki value. If the value is too large, we will oscillate around the target value, and if it is too small, we will approach the target value very slowly.

• Derivative calculation

Kd d

dte(t) (3.4)

where Kd is the derivative gain, and e(t) will be differentiated with respect to time. This calculation will help us slow the output’s rate of

(33)

change. A large Kdmeans a slower change, where a smaller value means a quicker change. As opposed to the proportional calculation, which lets the controller react quicker or slower, the derivative calculation dampens overshoot and oscillation around the target value.

The sum of these three calculations is then fed into the device we wish to control. Using the same example as earlier, this would be a cruise controller. The cruise controller then uses the newly acquired value to adjust the throttle and therefore the current speed, and the resulting speed is again fed back and lets us calculate a new error to use as an input.

3.2 Controlling a Web Cluster using a PID

To paraphrase [19]: a system level controller can make a cluster of low-power servers power proportional. The controller will use sleep states to switch CPUs on or off in order to continuously match the current workload with the system’s capacity. It uses methods from control theory to drive the CPUs from and into sleep states. Results from system simulation show that this approach can achieve a up to 80% reduction in energy consumption compared to a cluster using only DVFS.

In the previous section we used a cruise controller as an example to explain a PID controller. While applying the same concept to a web cluster’s power manager may seem different at first, it is actually fairly similar. We use the same formula - eq. 3.1 - to calculate our input, output and error values, they just refer to different things. Here we will use the system’s work load - in terms of number of requests per second - as our target value. System capacity is measured in maximum number of requests per second the system can handle, which, when we know the number of CPUs in the system, translates into the number of CPUs we need to meet the demand, or target value if you will. This is also our feedback value. The difference between the work load and number of active CPUs is then used to calculate the error value [19].

This scheme will not work if no one takes responsibility for turning CPUs on and off, which is why one core will be kept statically active. The static core will also be used to handle the low number of requests trickling in when the system is otherwise almost idle between peak hours. Currently, as the system in our reference was built around ARM Cortex-A8 CPUs, once a CPU is activated by the power manager (PM), it will run at the highest possible frequency. This is not, from a power consumption point of view, ideal, even though the ARM CPUs are designed to consume very little power [19][20].

(34)

Also, because the only scheduler currently available is the CFS, which by design uses all resources equally, we run into trouble whenever the workload drops. While the workload increases the PID will keep turning on CPUs to meet the demand, so they will all be fully loaded per definition. When the demand starts dropping off, however, CFS will start spreading tasks out over all the available CPUs. This in turn will mean the PID will be unable to switch the CPUs off. Therefore, as the PID is unable to influence scheduling, we will need a scheduler capable of reliably consolidating tasks on as few CPUs as possible, ideally with a negligible impact on performance. It is currently unclear as to whether or not CFS, even with _M C or _SM T , is capable of this. We will thus change the behaviour of the CFS to fill CPUs sequentially and its LB to always keep as many CPUs as possible fully loaded. These changes will then be compared against the original CFS both with and without its power-saving options turned on. This is how the work in this thesis relates to the PID Controlled Power Manager. We have been looking at ways to schedule tasks in a way that is beneficial both for the energy efficiency and performance.

(35)

CHAPTER

FOUR

PROPOSED SCHEDULER CHANGES

In Chapter 3 we concluded an ideal scheduler for the PID PM would fill CPUs with tasks sequentially and its LB would ensure that we use as few CPUs as possible. When we herein say "sequentially" we mean; scheduling tasks onto one CPU until it has reached its maximum capacity. CFS does not currently do this - see Chapter 2. Even with its supposed power-saving options turned on, it will first schedule tasks onto all available CPUs and then during LB attempt to consolidate tasks. This is based on an impression gotten from reading the Kernel code however, so it will be tested in LinSched. LinSched as a tool is covered in Chapter 5.1.

In this chapter we will explain the changes made to the CFS’ code as an attempt to improve its power-saving capabilities. The scheduler we get from using these changes will henceforth be referred to as the Overlord Guided Scheduler (OGS). The name is a play at the fact that the PID, our Overlord, might decide to turn almost everything off at any time. Since we will only change the selection algorithms we will not have to concern ourselves with any locking mechanisms, which is quite fortunate since LinSched does not help verify locking. Still, to be on the safe side we will verify that changing the selection of RQs does not somehow circumvent the current locking mechanisms will. There was, however, no time to do this during this thesis project and it will thus be recommended as Further Work.

The changes suggested below, disregarding the proposed LB changes, have been spliced into the original CFS code at the appropriate places. The OGS code has been put behind #if def declarations, and will thus only be included

(36)

and used when expressly chosen at the Kernel build stage. If we do not select OGS, then the CFS scheduler is used in its original state. Building a Kernel with the tweaked code and using only the default build options will result in the Kernel using CFS. We have attempted to splice the proposed scheduler changes into the existing scheduler instead of writing it from scratch because of the time constraints imposed on this thesis project. Had we decided to write one from scratch, we would, for instance, have had to understand and implement locking, which in itself would have easily used more than the amount of time we had available.

4.1 Runqueue Selection

Every new task has to be scheduled on a RQ on some CPU. For CFS this is always the idlest RQ, as was explained in Chapter 2.1.2. To make good use of the PID PM we want to always schedule tasks on the busiest non-overloaded RQ.

Using LinSched and a number of well placed text prints, as well as by looking at the Kernel code we identified select_task_rq() as the function making all the important decisions regarding RQ selection. This function then makes a task class-specific call to another function which makes the ultimate selection and returns it. We have not bothered with RT tasks due to the time constraints of this project, and thus the selection function we are interested in is select_task_rq_f air(). This function is called whenever a RQ is to be selected for a task falling under the influence of the CFS and we then proceed to make a RQ selection as was specified in Chapter 2.1.2. Looking at this function we can come to the conclusion that in order to change the RQ returned by this function, we need to change what f ind_idlest_group() and f ind_idlest_cpu() decide to return.

4.1.1 find_idlest_group() vs find_busy_group()

The current implementation of this function will, as the name implies, return the idlest SG from a given SD based on SGs’ average weighted load. The code below is copied from f ind_idlest_group() in the 2.6.35 version of the Linux Kernel. It is the part of that function upon which all decisions are based. The lines have been numbered after copying to make referencing it easier.

1 /* Tally up the load of all CPUs in the group */ 2 avg_load = 0;

(37)

3

4 for_each_cpu(i, sched_group_cpus(group)) {

5 _{/* Bias balancing toward cpus of our domain */} 6 if (local_group)

7 load = source_load(i, load_idx); 8 else

9 load = target_load(i, load_idx); 10

11 avg_load += load; 12 }

13

14 /* Adjust by relative CPU power of the group */

15 avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power; 16

17 if (local_group) {

18 this_load = avg_load; 19 this = group;

20 } else if (avg_load < min_load) { 21 min_load = avg_load;

22 idlest = group; 23 }

It is easy to see that the for-loop starting on Line4 iterates through all members of the SG and tallies the total load. What is unclear is how the perceived load changes depending on whether the group is local or not - Lines6 − 9. Based on the code comments we do know it is supposed to bias the selection towards a local group. Looking at the definitions of source_load() and target_load() in sched.cwe can see that the bias lies in the size of the weighted_cpuload() that is added to avg_load on Line11. This again takes us back to the problem of load weights and their lacking documentation. Sorting out the exact way the load weights work and are assigned was, as we mentioned earlier, deemed to exceed the scope of this project. Although the general impression is that assigning weights to tasks may not result in a, for our purposes, ideal task distribution. This will affect our ability to detect the current load as well, what with the current upper limit also being based on load weights. The PID will need a feedback value telling us whether or not we are likely to need to turn on more CPUs, therefore we have another problem.

One solution to these problems would be to make these decisions based on how idle a CPU has been since we last scheduled a task on it. This is essentially already done by the Governors - covered in Chapter 2.2 - as they look at how idle a CPU is and adjusts the CPU’s frequency accordingly. As all the functionality is already in place, it would then be a short step to use

(38)

the same information when making decisions regarding RQ selection. For the Governors to work, there needs to be a structure storing all the different contributions to the CPUs utilisation. This structure is already defined in include_linux_kernel_stat.h, so there is no need to define another. The update functions for the CPUs’ busy times are also already in place. Using the information gathered on behalf of the Governors to make our RQ selections is then a trivial matter.

25 _{static struct sched_group *}

26 _{find_busy_group(struct sched_domain *sd, struct} task_struct *p, int this_cpu, cputime64_t time_now)

27 {

28 _{struct sched_group *busy_group = NULL,} *group = sd->groups; 29 // To keep track of the most loaded group 30 cputime64_t min_idle = time_now;

31 cputime64_t idle_time; // Time spent idle

32 cputime64_t SAFETY_MARGIN = 100000; //Arbitrary number 33

34 do {

35 int i; 36

37 _{/* Skip groups with no CPUs allowed */}

38 if (!cpumask_intersects(sched_group_cpus(group), &p->cpus_allowed))

39 continue; 40

41 _{/* Reset idle_time and safety for each group */} 42 idle_time = 0; 43 safety = 0; 44 45 for_each_cpu(i, sched_group_cpus(group)) 46 { 47 idle_time = cputime64_add(idle_time, what_is_idle_time(i, time_now)); 48 safety += SAFETY_MARGIN; 49 } 50

51 if( (idle_time < min_idle) && (idle_time > safety) )

52 {

53 min_idle = idle_time; 54 busy_group = group;

(39)

55 } 56

57 }while(group = group->next, group != sd->groups); 58 59 if(!busy_group) 60 { 61 return NULL; 62 } 63 64 return busy_group; 65 }

Above we can see the code for the suggested replacement function without most of the comments, but otherwise in its entirety. When the new function is called, it takes essentially the same inputs as the old one, except we need to get the current time instead of load_idx. Another difference is that currently we do not get a specific SD to look in, as we have not yet changed the flags to be set appropriately for our purposes. Due to this we iterate through the available SDs looking for a suitable SG. To find a suitable SG we use the for-loop on Line45, which sums up the total idle time in the group, as well as an, at this point, arbitrary safety margin. The function what_is_idle_time() was written specifically for the new RQ selection, and it’s functionality will be covered later in this chapter. When all the members of the SG have been accounted for, we move on to the conditional on Line51, where we check if the current SG is more suitable than the SG currently perceived as the best candidate. An SG is perceived as a better candidate if:

• If the current SG’s idle time is smaller than the current best. During the first complete iteration, the current best idle time will be equal to the current system uptime. It is not possible for the SGs to have been idle longer than the system has been running. There also has to be at least one SG which has not been completely idle, since the OS has to run somewhere.

• If the idle time is greater than one safety margin per SG member, then the SG is deemed to not be overloaded. The safety margin is used because we currently have no idea about what the new task we want to schedule requires in terms of CPU time.

When this is done for all SDs and SGs available with the above conditions, we will end up with the SG which best matches the conditions at the time the function is called. Then, provided we have found a suitable

(40)

candidate SG, we return it to select_task_rq_f air() and used as an input in f ind_busy_sched_cpu().

4.1.2 find_idlest_cpu() vs find_busy_sched_cpu()

The current implementation of this function suffers from the same prob-lems as f ind_idlest_group(). As can be seen on Line72 below, it is based around weighted_cpuload(). Using the SG reported as the idlest by f ind_idlest_group(), we iterate through the group members looking for the CPU with the smallest load. The found CPU is then returned to select_task_rq_f air() and will be selected as the target RQ for the new task. This, again, is not in line with our aims. We would like to select the RQ with the highest load available, while not overloading its CPU.

70 /* Traverse only the allowed CPUs */

71 for_each_cpu_and(i, sched_group_cpus(group), &p->cpus_allowed) {

72 load = weighted_cpuload(i); 73

74 if (load < min_load || (load == min_load && i == this_cpu)) { 75 min_load = load;

76 idlest = i; 77 }

78 }

79 return idlest;

Below we have the suggested replacement function, which would work together with f ind_busy_group() in finding a suitable RQ to schedule on, in its entirety. Again without most of the comments.

The replacement function here is quite similar to the original in that it does the same kind of iteration - starting on Line94 - through all the CPUs in the group. The main difference here is that, again, we look at how idle the CPU has been, instead of its weighted_cpuload(). Then, on Line99 we get the RQ struct for the current CPU, which contains information about the number of tasks running on this CPU. We do this because we will, on Line104, calculate how much execution time the currently running tasks require on average. This average will then be used as an estimate of how much execution time the new task is likely to require. Estimating task requirements using an average is not an ideal solution, as it is quite likely to over- or underestimate the amount of time the new task will require. Although, since we currently have no way of knowing

(41)

anything about new tasks, this solution will at least give us an estimate of how much busier another task might make the CPU.

85 static int

86 _{find_busy_sched_cpu(struct sched_group *group,}

struct task_struct *p, int this_cpu, cputime64_t time_now) 87 { 88 int busy_sched = -1; 89 int i; 90 cputime64_t idle_time = 0; 91 cputime64_t avg_busy_time_per_task; 92 cputime64_t min_idle = time_now; 93

94 for_each_cpu(i, sched_group_cpus(group)) 95 {

96 _{/* Get the time cpu(i) has spent idle */} 97 idle_time = what_is_idle_time(i, time_now); 98

99 _{struct rq *arr_qu = cpu_rq(i);} 100 101 if (arr_qu->nr_running < 1) 102 avg_busy_time_per_task = 0; 103 else 104 avg_busy_time_per_task = cputime64_sub(time_now, idle_time) /(arr_qu->nr_running); 105

106 if ( (idle_time < min_idle) && (idle_time > (SAFETY_MARGIN+avg_busy_time_per_task)) )

107 {

108 min_idle = idle_time; //refresh the reference value 109 busy_sched = i; //record which cpu this is

110 }

111 } 112

113 return busy_sched; 114 }

Before calculating said average - on Line102 - we make sure we have at least one task running before calculating the average. If no tasks are running on the CPU, then the average busy time per task is set to naught. Otherwise we

(42)

calculate the average normally. This is to avoid a potential division by zero. Once we have an average, we compare the idle time against the current best. We conclude we have found a new target CPU if:

• The current idle time is lower than the current best. The initial best value is, again, the current system uptime.

• The current idle time is higher than the sum of the safety margin and the average busy time per task. This is again a safety mechanism, which is used due to the uncertainty around the new task’s requirements.

Once we have found a suitable target CPU, we return it to select_task_rq_f air() and schedule the new task on it. If this function returns a null value, then we obviously need an alternative target CPU. This goes for the default CFS as well. Fortunately select_task_rq() - i.e. the function calling the task-specific RQ selection function from sched.c - already has a function taking care of this in select_f allback_rq(). This function has currently been left as is. Suffice to say it makes sure we find somewhere to schedule our task, even if our selection method found all CPUs to be unsuitable. At a later stage this function should also be looked at in order to find out whether or not it has to be changed. The accuracy of our selection procedure will have to be evaluated, as it is quite heavily dependent on the average busy time calculation. If a CPU has a few tasks taking up a large portion of the available execution time, then the consequently high average may cause us to choose another CPU, even if the aforementioned CPU would be capable of handling a few low load tasks without being overloaded. It is possible that this problem is best solved during LB, and a possible, although not yet investigated solution will be mentioned in Section 4.2.

4.1.3 what_is_idle_time()

This function was put in sched_f air.c for the express purpose of getting and reporting a CPU’s idle time to OGS’ RQ selection functions. It is essentially identical to the equivalent function - get_cpu_idle_time_jif f y() - used by, e.g., the Ondemand Governor. The only differences being that the current system uptime here, time_now, is not a pointer and that we, on Line132, make sure we have no negative idle time values. It should not be possible to have a busy time that is longer than the system uptime, but due to some of the workarounds used in LinSched, detailed in Section 5.1.3, it did happen occasionally. The condition has been left in as a safety mechanism.

(43)

What this function then needs to work properly is a CPU, and the current system uptime. Which CPU we’ll be looking at is normally given by an unsigned int, which is basically the CPU’s ID number as far as the Kernel is concerned. The current system uptime is here stored in the variable time_now. Its value is gotten from a check in select_task_rq_f air() and given as an input to the f ind_busy_∗ functions, which pass it on to this function. We are currently doing it this way as it keeps both the number of time checks down, and time_now constant for the duration of the RQ selection process. There is a small chance that changing time_now mid-selection will warp the result, hence its being constant while we are looking for a RQ.

Lines127 − 130 are the ones where we actually add the busy times. Judging from the structure in kernel_stat.h, the Kernel distinguishes between five different types of business, and these types are stored as elements in said structure. The total business is then the sum of the five business types, with the idle time idling being the difference between this sum i.e. busy_time -and time_now. The difference is then returned to the calling function.

120 static inline cputime64_t

what_is_idle_time(unsigned int cpu,

cputime64_t time_now) { 121

122 _{/* Variables for tallying idle time */} 123 cputime64_t busy_time; // Time spent busy 124 cputime64_t idling; // Temp variable

125

126 _{/* Tallying the idle time */}

127 busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user, kstat_cpu(cpu).cpustat.system); 128 busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq); 129 busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal); 130 busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice); 131 132 if (unlikely(busy_time > time_now)) 133 idling = 0; 134 else

135 idling = cputime64_sub(time_now, busy_time); 136

137

(44)

139 }

It is possible we will not even need this function in a full Kernel adaptation, however the experimentation herein was done using LinSched, which only uses the parts necessary to emulate the normal scheduling subsystem, and no functionality from the CP U f req subsystem was therefore available.

To reiterate then; the changes mentioned above, called OGS, will move us away from selecting the idlest RQ and to selecting the busiest RQ with some remaining capacity. The capacity is determined by summing up the different types of business stored in the cpu_usage_stat struct in kernel_stat.h. We currently go through all the SDs to find the SG with the most suitable idle time, and then move on to find the CPU with the most suitable idle time in the previously detected SG. It is our feeling that making RQ selection decisions based on how idle a CPU has been may lead to a load distribution that is closer to the, for our purposes, ideal distribution compared to CFS’ scheme with weighted tasks. Said ideal distribution would have the tasks scheduled on as few CPUs as possible. It is also a lot easier to detect when a CPU is about to become fully loaded when using OGS, which is extremely important in order for the PID controller to be able to work properly, as we can simply look at how close to naught the idle time is. Neither do we need to store any information relating specifically to the capacity of the available CPUs, since, again, the information is already available in the idle time and its approaching naught. We therefore only need the current system uptime, the number of tasks belonging to the RQ, and how busy the CPU has been in order to make our RQ selection. All of these are already available in the kernel. The system uptime is stored in a variable called jif f ies, the number of tasks in a RQ is an externally visible scheduling statistic, while the Governors keep track of a CPU’s busy time.

Figure 4.1 above illustrates how CFS and the weighted CPU load is assumed to affect the perceived business of a CPU compared to OGS and idle time. Each L-shaped block represents one task and its execution time requirements, while the space it takes up in the figure is equivalent to how the scheduler perceives it. Both schedulers have one CPU at their disposal in this illustration. Our assumption then is that the load weights may cause the CFS RQ selection to perceive each task as requiring an execution time equivalent to a 2-by-3 block instead of its actual requirement. Due to this, it might see a CPU as being fully loaded after 16 tasks have been scheduled on the CPU in question. The OGS on the other hand, is capable of detecting how much time the scheduled tasks have actually been using and is therefore able to schedule tasks in such a way that we use our CPUs more efficiently in terms of execution time. Again; this is

(45)

Figure 4.1: Illustration of how the weighted CPU load is assumed to affect perceived business

an assumption about the CFS and our aim for the OGS, and we will therefore attempt to validate said assumption and aim empirically.

The changes detailed in this section (4.1) only affect the RQ selection process. The function charged with finding a RQ for new tasks is the only one we have changed. Therefore, as CFS should work quite well for our purposes when using the OGS RQ selection, we use the original CFS code for, e.g., locking, and actually giving tasks CPU time.

4.2 Load Balancing

As was explained in Section 2.1.3, the CFS LB will by default attempt to achieve an even task distribution. This means the LB will always pull tasks from the busiest to this RQ. While CFS does have the SCHED_M C and SCHED_SM T options for pulling tasks from almost idle RQs, they only seem to work in cases where we have a as-good-as idle RQ, and pulling tasks from it would not upset the otherwise even task distribution, whereas we would like to minimise the number of CPUs in use by keeping the load as high as possible without overloading. We will attempt to show this in Chapter 6. Although we do have a relatively clear idea of what we would like to implement, we currently only have a superficial plan for affecting this change. The herein proposed changes have therefore not been tested.

(46)

The CFS LB works according to our needs as far as Step 5, by the reckoning used in Chapter 2.1.3. When we get as far as Step 5, we start looking for source and target RQs. We will refer to the RQ from which we pull a task as the source RQ, and to the RQ performing the pull as the target RQ here. At this stage CFS finds the busiest queue - f ind_busiest_queue() - in the busiest group - f ind_busiest_group(). When finding the group, CFS uses a struct called sd_lb_stats. This struct is updated every time the group-finding function is called, and contains information relevant to LB at this level. The elements are used at various points by different functions to achieve a balanced load.

4.2.1 find_busiest_group()

The elements used by f ind_busiest_group() to decide if we are imbalanced are: • sds.this_load (struct sd_lb_stats sds) - Contains the perceived load of the

we are a member of.

• sds.busiest - The currently busiest group

• sds.busiest_nr_running - The number of tasks running in the busiest group

• sds.max_load - The highest group load in this SD • sds.avg_load - The average load in this SD

• balance - An integer value, used as a boolean, which indicates whether this is the appropriate CPU to perform LB at this SD level or not.

We previously - in Chapter 2.1.3 - mentioned that we check a number of cases which tell us if there is no imbalance from this CPU’s point of view if. These cases are:

• #1 This CPU is not the appropriate CPU to perform LB at this level. We check the value in balance and if it is 0 this CPU is not appropriate and we jump to ret, otherwise we move on.

• #2 There is no busy sibling group to pull from. Here we look at the busiest and busiest_nr_running elements in sds. If we either have no busiest group, or if there are no tasks running, i.e. busiest_nr_running == 0, we jump to out_balanced. In all other cases we move on.

POWER EFFICIENT SCHEDULING FOR

P

OWER

E

FFICIENT

SCHEDULING FOR

A

C

LOUD

S

YSTEM

Joachim Sjöblom

ABSTRACT

ABBREVIATIONS

GLOSSARY

LIST OF FIGURES

CONTENTS

CHAPTER

ONE

INTRODUCTION

1.1

Cloud Software project

1.2

Purpose of this thesis

1.3

Thesis Structure

CHAPTER

TWO

SCHEDULING AND POWER MANAGEMENT

IN THE CURRENT LINUX KERNEL

2.1

The Completely Fair Scheduler

2.1.1

CFS Runqueue

2.1.2

Runqueue Selection

2.1.3

Load Balancing

2.2

Governors

2.3

CPU Load Average

CHAPTER

THREE

CONTROLLED POWER MANAGEMENT

3.1

Basic PID Control Theory

3.2

Controlling a Web Cluster using a PID

CHAPTER

FOUR

PROPOSED SCHEDULER CHANGES

4.1

Runqueue Selection

4.1.1

find_idlest_group() vs find_busy_group()

4.1.2

find_idlest_cpu() vs find_busy_sched_cpu()

4.1.3

what_is_idle_time()

4.2

Load Balancing

4.2.1

find_busiest_group()