Scheduling in operating system


Scheduling (computing)

Context switch

Evolution of the x86 context switch in Linux
Al Viro’s new execve/kernel_thread design
commit 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd
Author: Brian Gerst
Date: Sat Aug 13 12:38:19 2016 -0400
sched/x86: Rewrite the switch_to() code
Why does switch_to use push+jmp+ret to change EIP, instead of jmp directly?


Voluntary Kernel Preemption, 2.6.12-rc4-mm2

Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.

voluntary preemption
Optimizing preemption
commit 41719b03091911028116155deddc5eedf8c45e37
Refs: v2.6.29-rc1-226-g41719b030919
Author: Peter Zijlstra
AuthorDate: Wed Jan 14 15:36:26 2009 +0100
Commit: Ingo Molnar
CommitDate: Wed Jan 14 18:09:00 2009 +0100
mutex: preemption fixes
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.
Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.
Firo: spin_unlock_mutex

User preemption - Linux kernel user mode is always User preemption.

system call returns mode . syscall_return_slowpath
interrupt hander returns user mode .retint_user->prepare_exit_to_usermode

Linux kernel kernel mode is coppertive when CONFIG_PREEMPT is not set.

bloked (which results in a call to schedule())
If a task in the kernel explicitly calls schedule() it’s involuntary!!!

Linux kernel kernel mode is coppertive + preemptive when CONFIG_PREEMPT is set.

  • When an interrupt handler exits, before returning to kernel-space.
  • __local_bh_enable_ip -> preempt_check_resched

    The following also t relates to preemption; it’s PREEMPT_VOLUNTARY.

    For example, in might_resched(). The task willingly yeilds the CPU, but it should stay on rq.
    bool “Voluntary Kernel Preemption (Desktop)”
    This option reduces the latency of the kernel by adding more
    “explicit preemption points” to the kernel code. These new
    preemption points have been selected to reduce the maximum
    latency of rescheduling, providing faster application reactions,
    at the cost of slightly lower throughput.

  • need_resched - When kernel code becomes preemptible again.

  • set_tsk_need_resched() in resched_curr
    tick: check_preempt_tick or entity_tick
    fork: wake_up_new_task->check_preempt_curr->check_preempt_wakeup
    wakeup: check_preempt_wakeup

  • if (need_resched()) cond_resched();


  • if (!preempt && prev->state)in __schedule; why prev->state?
    prev->state means deactivate.


Process scheduling in Linux – Volker Seeker from University of Edinburgh
A complete guide to Linux process scheduling
JINKYU KOO’s Linux kernel scheduler

TIF_NEED_RESCHED: why is it needed

General runqueues

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
activate_task - move a task to the runqueue.


Improving scheduler latency

CFS runqueues

sched_entity->on_rq, check enqueue_entity

CFS runqueue and sched entity



on_rq should be same as task->on_rq. It doesn’t mean sched_entity is on cfs_rq, but rq.
commit fd2f4419b4cbe8fe90796df9617c355762afd6a4
Author: Peter Zijlstra
Date: Tue Apr 5 17:23:44 2011 +0200
sched: Provide p->on_rq
p->on_rq on any rq.
se->on_rq on specific rq.

CFS runqueue and task group

sched_create_group -> alloc_fair_sched_group -> init_tg_cfs_entry

CFS core codes

git log 20b8a59f2461e

Group scheduling


Two trees

task_group->parent; task_group->css.cgroup
cgroup->parent and cgroup_tg: container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id), struct task_group, css);

Task group and cgroup is 1:1

System bootup

struct task_group root_task_group; and cpu_cgroup_create;

Creating task_group

task_group 1 : cpu ‘group sched_entity’
group sched_entity 1 : 1 greoup cfs_rq
gse_CPUx’s load = grq_CPUx’s all se’s load * task_group->shares / grq_CPU’s all se’s load
rq on which this entity is (to be) queued: */
struct cfs_rq cfs_rq;
rq “owned” by this entity/group: */
struct cfs_rq *my_q;

CFS group scheduling
commit 29f59db3a74b0bdf78a1f5b53ef773caa82692dc
Author: Srivatsa Vaddagiri
Date: Mon Oct 15 17:00:07 2007 +0200
sched: group-scheduler core

Why double for_each_sched_entity

commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
Author: Peter Zijlstra
Date: Mon Nov 15 15:47:00 2010 -0800
sched: Rewrite tg_shares_up)

371fd7e7a56a5 (Peter Zijlstra 2010-03-24 16:38:48 +0100 1129) enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1134) for_each_sched_entity(se) {
62fb185130e4d (Peter Zijlstra 2008-02-25 17:34:02 +0100 1135) if (se->on_rq)
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1136) break;
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1137) cfs_rq = cfs_rq_of(se);
88ec22d3edb72 (Peter Zijlstra 2009-12-16 18:04:41 +0100 1138) enqueue_entity(cfs_rq, se, flags);
88ec22d3edb72 (Peter Zijlstra 2009-12-16 18:04:41 +0100 1139) flags = ENQUEUE_WAKEUP;
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1140) }
8f4d37ec073c1 (Peter Zijlstra 2008-01-25 21:08:29 +0100 1141)
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1142) for_each_sched_entity(se) {
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1143) struct cfs_rq *cfs_rq = cfs_rq_of(se);
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1144)
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1145) update_cfs_load(cfs_rq);
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1146) update_cfs_shares(cfs_rq);

Wake up

sched: lockless wake-queues
Futex Scaling for Multi-core SystemsSlides

Program-Order guarantees

commit 8643cda549ca49a403160892db68504569ac9052
Author: Peter Zijlstra
Date: Tue Nov 17 19:01:11 2015 +0100
sched/core, locking: Document Program-Order guarantees

LKML discussions

scheduler ordering bits
scheduler ordering bits -v2


commit b29739f902ee76a05493fb7d2303490fc75364f4
Author: Ingo Molnar
Date: Tue Jun 27 02:54:51 2006 -0700
[PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

rq->lock in schedule() and context_switch()

commit 3a5f5e488ceee9e08df3dff3f01b12fafc9e7e68
Author: Ingo Molnar
Date: Fri Jul 14 00:24:27 2006 -0700
[PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW
+ * Since the runqueue lock will be released by the next
+ * task

Running time



Why scheduling?

Customers demand multitasking/concurrent
Processes are blocked


Unit: /proc/sys/kernel/sched_min_granularity_ns


Cpu capacity

Running Compensator records the running process

task_tick_fair -> entity_tick
sum_exec_runtime - total runtime
cfs_rq->exec_clock - cfs_rq runtime
vruntime - inverse proportion to the weight or priority
cfs_rq->curr, leftmost, min_vruntime, who is min?
cpuacct - cpu sys/user time

Next -> pick_next_task_fair

put_prev_entity: update_curr; insert into rb-tree;
pick_next_entity: left most of rb-tree.
set_next_entity: remove next from tree since it will disturb inserting and deleting when it is being updated.




try_to_wake_up->ttwu_queue->ttwu_do_activate-> or local wakeup: schedule->try_to_wake_up_local->
ttwu_activate #=== speical compensation and enqueue rq
p->on_rq = TASK_ON_RQ_QUEUED #=== 1) rq for task; 2)
ttwu_do_wakeup #=== normal compensation
p->state = TASK_RUNNING;

enqueue_task-> place_entity compensation for wakeup process

wake up a sleep task

se->on_rq & TASK_ON_RQ_QUEUED; deactivate_task set on_rq to 0;
enqueue_task_fair handles group stuff
enqueue_entity deals with sched_entity - uptodate the vruntime, load average, account load numa perfering,
sysctl_sched_latency: the cfs pledge to the pre-existing tasks that they have 6ms to run before new task to run.
try_to_wake_up_local for local task
try_to_wake_up for any task

New task

speical debit compensation: sched_fork->task_fork_fair->place_entity - compensation for new process
normal compensation: wake_up_new_task
activate_task #=== speical compensation
check_preempt_curr #=== normal compensation


  • weight
  • priority


  1. sched_nr_latency= /proc/sys/kernel/sched_latency_ns / /proc/sys/kernel/sched_min_granularity_ns
  2. if running process > sched_nr_latency, latency cannot be ensured. just focus on min granularity


  3. is the difference of leftmost and rightmost smaller than sched_min_granularity_ns??

  4. sched_slice


    blocked & schedule
    check preempt & schedule
    check_preempt_tick # new preempts curr
    curr running time > sched_slice # enough time to yield.
    curr - leftmost > sched_slice # nice to others.
    check_preempt_wakeup # the wakeuped preempts curr
    curr - wakeuped > sysctl_sched_wakeup_granularity; # pass the wakeup-preempt-delay

    io wait

    Load avg

    Check External links
    Etymology of avenrun: average nr. of running processes during


  5. h_nr_running and throttled
    sched: Implement hierarchical task accounting for SCHED_OTHER - 953bfcd10e6f3697233e8e5128c611d275da39c1!topic/linux.kernel/gRzxHclMy50
    / \
    t1 t2
    root.nr_running := 2
    root.h_nr_running := 2
    Check enqueue_task_fair()

  6. idle
    improve SMP reschedule and idle routines
    TIF_POLLING_NRFLAG -> Need-Resched-Flag?

  7. process migration
    Introduce cpu_active_map and redo sched domain managment
    When to migration
    sched_setaffinity __set_cpus_allowed_ptr manuly
    Selecting a new CPU during wak up a sleeper
    For balancing, selecting CPU during wake up new process in _do_fork
    execve’s sched_exec

  8. shceduler clock
    rq->clock is nano seconds?

  9. clock_task and wraps

    no standalone commit
  • why ahead?
    8 /*
    9 * Place new tasks ahead so that they do not starve already running
    10 * tasks
    11 */
    the tree is named timeline
  • Improving scheduler latency
  • skip next last buddy

Git log

TASK_WAKING; see migrate_task_rq_fair and try_to_wake_up
In order to remove the cfs_rq dependency from set_task_cpu() we need to ensure the task is cfs_rq invariant for all callsites.
sched/fair: Fix fairness issue on migration
Migrated CFS task getting an unfair advantage
curr was not kept in rb-tree

Load balancing

Scheduling domains
set sd
visit_domain_allocation_hell()->sdt_alloc() alloc the sdd->sg which is used by build groups
and sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(); it covered the size of cpumask
/* Build the groups for the domains */

static noinline struct sched_domain *
sdinit##type(struct sched_domain_topology_level *tl, int cpu)
struct sched_domain *sd = *per_cpuptr(tl->, cpu);
*sd = SD
SD_INIT_NAME(sd, type);
sd->private = &tl->data;
return sd;
static struct sched_domain_topology_level default_topology[] = {
{ sd_init_SIBLING, cpu_smt_mask, },
{ sd_init_MC, cpu_coregroup_mask, },
{ sd_init_BOOK, cpu_book_mask, },
{ sd_init_CPU, cpu_cpu_mask, },
{ NULL, },

Leaf CFS runqueues leaf_cfs_rq


commit 6aa645ea5f7a246702e07f29edc7075d487ae4a3
Refs: v2.6.22-14-g6aa645ea5f7a
Author: Ingo Molnar
AuthorDate: Mon Jul 9 18:51:58 2007 +0200
Commit: Ingo Molnar
CommitDate: Mon Jul 9 18:51:58 2007 +0200
sched: cfs rq data types
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
* leaf_cfs_rq_list ties together list of leaf cfs_rq’s in a cpu. This
* list is used during load balance.
Head of list: rq->leaf_cfs_rq_list

Core load_balance_fair

commit bf0f6f24a1ece8988b243aefe84ee613099a9245
Refs: v2.6.22-10-gbf0f6f24a1ec
Author: Ingo Molnar
AuthorDate: Mon Jul 9 18:51:58 2007 +0200
Commit: Ingo Molnar
CommitDate: Mon Jul 9 18:51:58 2007 +0200
sched: cfs core, kernel/sched_fair.c
add kernel/sched_fair.c - which implements the bulk of CFS’s
behavioral changes for SCHED_OTHER tasks.
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {

make parent appear after us.

commit 67e86250f8ea7b8f7da53ac25ea73c6bd71f5cd9
Author: Paul Turner
Date: Mon Nov 15 15:47:05 2010 -0800
sched: Introduce hierarchal order on shares update list
Avoid duplicate shares update calls by ensuring children always appear before # leaf’s meaning is changed
parents in rq->leaf_cfs_rq_list.
This allows us to do a single in-order traversal for update_shares().
Since we always enqueue in bottom-up order this reduces to 2 cases:
1) Our parent is already in the list, e.g.

c d* (root->b->c already enqueued)
Since d’s parent is enqueued we push it to the head of the list, implicitly ahead of b.
2) Our parent does not appear in the list (or we have no parent)
In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
(bottom-up) it will appear to our right by the same rule.


commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647
Refs: v4.9-rc5-195-g9c2791f936ef
Author: Vincent Guittot
AuthorDate: Tue Nov 8 10:53:43 2016 +0100
Commit: Ingo Molnar
CommitDate: Wed Nov 16 10:29:08 2016 +0100
sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
67e86250f8ea (“sched: Introduce hierarchal order on shares update list”)

commit 5d299eabea5a251fbf66e8277704b874bbba92dc
Author: Peter Zijlstra
Date: Wed Jan 30 14:41:04 2019 +0100
sched/fair: Add tmp_alone_branch assertion
The magic in list_add_leaf_cfs_rq() requires that at the end of
rq->tmp_alone_branch == &rq->lead_cfs_rq_list

load_balance_fair - removed

commit 9763b67fb9f3050c6da739105888327587c30c4d
Refs: v3.0-rc7-197-g9763b67fb9f3
Author: Peter Zijlstra
AuthorDate: Wed Jul 13 13:09:25 2011 +0200
Commit: Ingo Molnar
CommitDate: Thu Jul 21 18:01:46 2011 +0200
sched, cgroup: Optimize load_balance_fair()
Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

Throttling entities

commit 85dac906bec3bb41bfaa7ccaa65c4706de5cfdf8
Author: Paul Turner
Date: Thu Jul 21 09:43:33 2011 -0700
sched: Add support for throttling group entities
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.