Scheduling in operating system


Scheduling (computing)

Context switch

Evolution of the x86 context switch in Linux
Al Viro’s new execve/kernel_thread design
commit 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd
Author: Brian Gerst
Date: Sat Aug 13 12:38:19 2016 -0400
sched/x86: Rewrite the switch_to() code
Why does switch_to use push+jmp+ret to change EIP, instead of jmp directly?


Process scheduling in Linux – Volker Seeker from University of Edinburgh
A complete guide to Linux process scheduling
JINKYU KOO’s Linux kernel scheduler

TIF_NEED_RESCHED: why is it needed


Improving scheduler latency

General runqueues

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
activate_task - move a task to the runqueue.
wake_up_new_task ttwu_do_activate


commit fd2f4419b4cbe8fe90796df9617c355762afd6a4
Author: Peter Zijlstra
Date: Tue Apr 5 17:23:44 2011 +0200
sched: Provide p->on_rq

CFS core codes

git log 20b8a59f2461e
sched_create_group -> alloc_fair_sched_group -> init_tg_cfs_entry

Wake up

sched: lockless wake-queues
Futex Scaling for Multi-core SystemsSlides

Program-Order guarantees

commit 8643cda549ca49a403160892db68504569ac9052
Author: Peter Zijlstra
Date: Tue Nov 17 19:01:11 2015 +0100
sched/core, locking: Document Program-Order guarantees

LKML discussions

scheduler ordering bits
scheduler ordering bits -v2


commit b29739f902ee76a05493fb7d2303490fc75364f4
Author: Ingo Molnar
Date: Tue Jun 27 02:54:51 2006 -0700
[PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

rq->lock in schedule() and context_switch()

commit 3a5f5e488ceee9e08df3dff3f01b12fafc9e7e68
Author: Ingo Molnar
Date: Fri Jul 14 00:24:27 2006 -0700
[PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW
+ * Since the runqueue lock will be released by the next
+ * task

Running Compensator records the running process

task_tick_fair -> entity_tick
sum_exec_runtime - total runtime
cfs_rq->exec_clock - cfs_rq runtime
vruntime - inverse proportion to the weight or priority
cfs_rq->curr, leftmost, min_vruntime, who is min?
cpuacct - cpu sys/user time

Next -> pick_next_task_fair

put_prev_entity: update_curr; insert into rb-tree;
pick_next_entity: left most of rb-tree.
set_next_entity: remove next from tree since it will disturb inserting and deleting when it is being updated.




try_to_wake_up->ttwu_queue->ttwu_do_activate-> or local wakeup: schedule->try_to_wake_up_local->
ttwu_activate #=== speical compensation and enqueue rq
p->on_rq = TASK_ON_RQ_QUEUED #=== 1) rq for task; 2)
ttwu_do_wakeup #=== normal compensation
p->state = TASK_RUNNING;

enqueue_task-> place_entity compensation for wakeup process

wake up a sleep task

se->on_rq & TASK_ON_RQ_QUEUED; deactivate_task set on_rq to 0;
enqueue_task_fair handles group stuff
enqueue_entity deals with sched_entity - uptodate the vruntime, load average, account load numa perfering,
sysctl_sched_latency: the cfs pledge to the pre-existing tasks that they have 6ms to run before new task to run.
try_to_wake_up_local for local task
try_to_wake_up for any task

New task

speical debit compensation: sched_fork->task_fork_fair->place_entity - compensation for new process
normal compensation: wake_up_new_task
activate_task #=== speical compensation
check_preempt_curr #=== normal compensation


  • weight
  • priority


  1. sched_nr_latency= /proc/sys/kernel/sched_latency_ns / /proc/sys/kernel/sched_min_granularity_ns
  2. if running process > sched_nr_latency, latency cannot be ensured. just focus on min granularity


  3. is the difference of leftmost and rightmost smaller than sched_min_granularity_ns??

  4. sched_slice


    blocked & schedule
    check preempt & schedule
    check_preempt_tick # new preempts curr
    curr running time > sched_slice # enough time to yield.
    curr - leftmost > sched_slice # nice to others.
    check_preempt_wakeup # the wakeuped preempts curr
    curr - wakeuped > sysctl_sched_wakeup_granularity; # pass the wakeup-preempt-delay

    io wait

    Load avg

    Check External links
    Etymology of avenrun: average nr. of running processes during


    improve SMP reschedule and idle routines
    TIF_POLLING_NRFLAG -> Need-Resched-Flag?

  5. process migration
    Introduce cpu_active_map and redo sched domain managment
    When to migration
    sched_setaffinity __set_cpus_allowed_ptr manuly
    Selecting a new CPU during wak up a sleeper
    For balancing, selecting CPU during wake up new process in _do_fork
    execve’s sched_exec

  6. shceduler clock
    rq->clock is nano seconds?

  7. clock_task and wraps

    no standalone commit
  • why ahead?
    8 /*
    9 * Place new tasks ahead so that they do not starve already running
    10 * tasks
    11 */
    the tree is named timeline
  • Improving scheduler latency
  • skip next last buddy

Load balancing
Scheduling domains
sched_init_domains or init_sched_domains build_sched_domains
visit_domain_allocation_hell()->sdt_alloc() alloc the sdd->sg which is used by build groups
and sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(); it covered the size of cpumask
/* Build the groups for the domains */

static noinline struct sched_domain *
sdinit##type(struct sched_domain_topology_level *tl, int cpu)
struct sched_domain *sd = *per_cpuptr(tl->, cpu);
*sd = SD
SD_INIT_NAME(sd, type);
sd->private = &tl->data;
return sd;
sched_domain_topology_level default_topology

Throttling entities

commit 85dac906bec3bb41bfaa7ccaa65c4706de5cfdf8
Author: Paul Turner
Date: Thu Jul 21 09:43:33 2011 -0700
sched: Add support for throttling group entities
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Load tracking - PELT

Per-entity load tracking
commit 5b51f2f80b3b906ce59bd4dce6eca3c7f34cb1b9
Author: Paul Turner
Date: Thu Oct 4 13:18:32 2012 +0200
sched: Make __update_entity_runnable_avg() fast
commit a481db34b9beb7a9647c23f2320dd38a2b1d681f
Refs: v4.11-rc2-229-ga481db34b9be
Author: Yuyang Du
AuthorDate: Mon Feb 13 05:44:23 2017 +0800
sched/fair: Optimize ___update_sched_avg()