The Gate of the AOSP #4 :Gerrit, Memory & Performance Process Scheduling in Linux
2013. 3. 29
Outline
1 Process scheduling
2 SMP scheduling
Process scheduling
• scheduler basics
• O(1) scheduler
• CFS
Terminology
• UTS
• Unix Time-sharing System
• task
• process or thread
• runqueue (rq)
• per-cpu list contains runnable tasks
• latency
• time delay between stimulus and response
• throughput
Scheduler basics
• scheduler
• CPU resouce manager • central part of kernel • controls time slices • schedules tasks (processes)
• decision maker
• when? who? how long? • latency/throughput • fairness
Task types
• interactive task
• I/O bound (ex. editor) • sleeps most of times • wants minimal latency
• batch task
• CPU bound (ex. compiler) • wants maximal throughput
Task state
Context switch
• when schedule a task • voluntary (sleep) • non-voluntary (preempt)
• save/restore task information • hw registers
• memory address space
• overhead
• cache eviction • TLB flush
Preemption
• schedule a (running) task (at any time?)
• configurable at compile time
• CONFIG_PREEMPT{,_NONE,_VOLUNTARY}
Kernel preemption points
• return to user (syscall, irq, exception) • PREEMPT_NONE
• might_sleep()
• PREEMPT_VOLUNTARY
• return from irq (preempt_count == 0)
• preempt_enable()/spin_unlock() • PREEMPT
Time slice
• round-robin time sharing system (UTS)
• a time unit allowed for a task at a given time
• can be affected by timer freq. (HZ)
• hard to optimize since
• it should be small for less latency • it should be large for better throughput
POSIX scheduling policy
• SCHED_FIFO • SCHED_RR • SCHED_OTHER • Linux-specific • SCHED_NORMAL • SCHED_BATCH • SCHED_IDLELinux scheduling class
• RT class • SCHED_FIFO • SCHED_RR • FAIR class • SCHED_NORMAL • SCHED_BATCH • SCHED_IDLEScheduling priority
• Root of Evil(tm)• but we need it anyway
• for less latency and higher throughput
Changing priority
$ chrt -m
SCHED_OTHER min/max priority: 0/0 SCHED_FIFO min/max priority: 1/99 SCHED_RR min/max priority: 1/99 SCHED_BATCH min/max priority: 0/0 SCHED_IDLE min/max priority: 0/0
Ideal CPU model
• share cpu resources to all tasks • run tasks simultaneously • each task owns its share • currently impossible
O(1) Scheduler
• used for RT tasks
• used for normal tasks (prior to 2.6.23)
• simple and fast algorithm
• fixed number of bitmap and array • static-assigned time slice
Completely-Fair Scheduler
• used for normal tasks• treat all tasks fairly
• unless they have different priority • vruntime deals with the priority
SMP scheduling
• cpu load tracking
• scheduler domain
CPU affinity
• set of cpus allowed to run a given task • all cpus are allowed by default
• must be considered when scheduling
• each task remembers its cpu running on
Setting cpu affinity
# taskset 0xff <command> # taskset -c 0-7 <command> # taskset -p [mask] <pid>
CPU load tracking
• CPU load = number of tasks running
• TASK_RUNNING + TASK_UNINTERRUPTIBLE
• global load average
• system load information (1, 5 ,15 min)
• cpu load average • for load balancing
Global load average
$ uptime
18:19:26 up 1 day, 7:51, 2 users, load average: 0.02, 0.03, 0.05
Moving average
• calc average of continuous data stream • smooth out short-term fluctuations • highlight longer-term trends
• kernel uses EMA for cpu load tracking • past term decreases exponentially • this process called “decay”
• http://en.wikipedia.org/wiki/Moving_average
Scheduler domain
• abstraction layer of hardware topology • a domain consists of groups
• a cpu resides in multiple hierachies • a domain is a group in higher domain
Load balancing
• migrate task according to cpu’s load • spread or pack
• chances to balance • fork & exec • wake up • idle • periodic
Load balancing strategy
• fork & exec
• spread to any idlest cpu
• wake up
• keep prev cpu or migrate to current cpu • or its idle sibling
• idle
• migrate to current cpu
• periodic
Scheduler domain fields
• for fine-tuning scheduler • SD_BALANCE_* flags • various indexes
• balance interval • imbalance percent
(H)MP scheduling
• started with per-entity load tracking patchset • calculate each sched entity’s load
• based on runnable time
• Linaro is trying to upstream the code for ARM big.LITTLE
Group scheduling
• control groups (cgroups)
• task group (cpu cgroup)
• group scheduling
Control groups
• way of grouping arbitrary tasks
• each group can implement what they want • cpu, memcg, block, device, perf, . . . • usually used for resource management
Cgroup filesystem
• maintain group hierachy in a pseudo fs
Using cgroupfs
# mount -t cgroup -o cpu nodev /sys/fs/cgroup/cpu # cd /sys/fs/cgroup/cpu
# ls
cgroup.clone_children cgroup.event_control cgroup.procs cpu.shares notify_on_release release_agent tasks
# mkdir aaa
# echo $$ > aaa/tasks
Scheduling entity
• abstraction of scheduling unit
• traditionally same as a task
• with group scheduling, it can be a task group
Task group
• abstraction of group of tasks
• viewed as a sched entity
Task group + SMP
• tasks in a group can be distributed on multiple CPUs • each cpu sees a portion of a task group
• need to distribute group’s share also • proportional to no. tasks in a cpu/group
• non-group entity (task) uses its share/load solely
Example of group scheduling (1)
• please refer to page 30
• all tasks has same load
• “professors” group • shares: 1024
• two running tasks: A and B
• “students” group • shares: 512 • no running task
• “system_tasks” group • shares: 512
CPU load in example 1
Example of group scheduling (2)
• same condition of example 1, but . . .
• in the “students” group • “Electonics students” group
• shares: 3072
• three running tasks: f, g, h
• “Computers students” group • shares: 3072
• two running tasks: u, v
• “Other students” group • shares: 1024
CPU load in example 2
Auto group
• when task sits in a (default) root group • not used if it moves to another group
• create a new group for new session • new terminal/login
Check autogroup enabled
# cat /proc/sys/kernel/sched_autogroup_enabled 1
Group scheduling in Android
• root group • system tasks
• foreground group
• currently running (user-visible) app • shares 95% of cpu
• background group • inactive apps • shares 5% of cpu
Thanks!
Q & A