Scheduling in operating system

scheduling

Scheduling (computing)

Context switch

Evolution of the x86 context switch in Linux
Al Viro’s new execve/kernel_thread design
commit 0100301bfdf56a2a370c7157b5ab0fbf9313e1cd
Author: Brian Gerst brgerst@gmail.com
Date: Sat Aug 13 12:38:19 2016 -0400
sched/x86: Rewrite the switch_to() code
Why does switch_to use push+jmp+ret to change EIP, instead of jmp directly?

Preemption

Voluntary Kernel Preemption, 2.6.12-rc4-mm2

Voluntary preemption works by adding a cond_resched()
(reschedule-if-needed) call to every might_sleep() check. It is lighter
than CONFIG_PREEMPT - at the cost of not having as tight latencies. It
represents a different latency/complexity/overhead tradeoff.

voluntary preemption
Optimizing preemption
commit 41719b03091911028116155deddc5eedf8c45e37
Refs: v2.6.29-rc1-226-g41719b030919
Author: Peter Zijlstra a.p.zijlstra@chello.nl
AuthorDate: Wed Jan 14 15:36:26 2009 +0100
Commit: Ingo Molnar mingo@elte.hu
CommitDate: Wed Jan 14 18:09:00 2009 +0100
mutex: preemption fixes
The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.
Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.
Firo: spin_unlock_mutex

User preemption - Linux kernel user mode is always User preemption.

system call returns mode . syscall_return_slowpath
interrupt hander returns user mode .retint_user->prepare_exit_to_usermode

Linux kernel kernel mode is coppertive when CONFIG_PREEMPT is not set.

bloked (which results in a call to schedule())
If a task in the kernel explicitly calls schedule() it’s involuntary!!!

Linux kernel kernel mode is coppertive + preemptive when CONFIG_PREEMPT is set.

  • When an interrupt handler exits, before returning to kernel-space.
    retint_kernel->preempt_schedule_irq->cond_resched
  • __local_bh_enable_ip -> preempt_check_resched

    The following also t relates to preemption; it’s PREEMPT_VOLUNTARY.

    For example, in might_resched(). The task willingly yeilds the CPU, but it should stay on rq.
    config PREEMPT_VOLUNTARY
    bool “Voluntary Kernel Preemption (Desktop)”
    help
    This option reduces the latency of the kernel by adding more
    “explicit preemption points” to the kernel code. These new
    preemption points have been selected to reduce the maximum
    latency of rescheduling, providing faster application reactions,
    at the cost of slightly lower throughput.

  • need_resched - When kernel code becomes preemptible again.

  • set_tsk_need_resched() in resched_curr
    tick: check_preempt_tick or entity_tick
    fork: wake_up_new_task->check_preempt_curr->check_preempt_wakeup
    wakeup: check_preempt_wakeup

  • if (need_resched()) cond_resched();

LQO

  • if (!preempt && prev->state)in __schedule; why prev->state?
    prev->state means deactivate.

Reference

Process scheduling in Linux – Volker Seeker from University of Edinburgh
A complete guide to Linux process scheduling
https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt
JINKYU KOO’s Linux kernel scheduler

TIF_NEED_RESCHED: why is it needed

General runqueues

static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
activate_task - move a task to the runqueue.
wake_up_new_task
ttwu_do_activate

Latency

Improving scheduler latency

CFS runqueues

cfa_rq
on_list
sched_entity->on_rq, check enqueue_entity

CFS runqueue and sched entity

set_task_rq

on_rq

on_rq should be same as task->on_rq. It doesn’t mean sched_entity is on cfs_rq, but rq.
commit fd2f4419b4cbe8fe90796df9617c355762afd6a4
Author: Peter Zijlstra a.p.zijlstra@chello.nl
Date: Tue Apr 5 17:23:44 2011 +0200
sched: Provide p->on_rq
p->on_rq on any rq.
se->on_rq on specific rq.

CFS runqueue and task group

sched_create_group -> alloc_fair_sched_group -> init_tg_cfs_entry

CFS core codes

git log 20b8a59f2461e

Group scheduling

GROUP SCHEDULER EXTENSIONS TO CFS
CFS调度器(3)-组调度
Linux进程组调度机制分析

Two trees

task_group->parent; task_group->css.cgroup
cgroup->parent and cgroup_tg: container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id), struct task_group, css);

Task group and cgroup is 1:1

System bootup

struct task_group root_task_group; and cpu_cgroup_create;

Creating task_group

sched_create_group
task_group 1 : cpu ‘group sched_entity’
group sched_entity 1 : 1 greoup cfs_rq
gse_CPUx’s load = grq_CPUx’s all se’s load * task_group->shares / grq_CPU’s all se’s load
/
rq on which this entity is (to be) queued: */
struct cfs_rq cfs_rq;
/
rq “owned” by this entity/group: */
struct cfs_rq *my_q;

CFS group scheduling
commit 29f59db3a74b0bdf78a1f5b53ef773caa82692dc
Author: Srivatsa Vaddagiri vatsa@linux.vnet.ibm.com
Date: Mon Oct 15 17:00:07 2007 +0200
sched: group-scheduler core

Why double for_each_sched_entity

commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
Author: Peter Zijlstra a.p.zijlstra@chello.nl
Date: Mon Nov 15 15:47:00 2010 -0800
sched: Rewrite tg_shares_up)

371fd7e7a56a5 (Peter Zijlstra 2010-03-24 16:38:48 +0100 1129) enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1134) for_each_sched_entity(se) {
62fb185130e4d (Peter Zijlstra 2008-02-25 17:34:02 +0100 1135) if (se->on_rq)
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1136) break;
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1137) cfs_rq = cfs_rq_of(se);
88ec22d3edb72 (Peter Zijlstra 2009-12-16 18:04:41 +0100 1138) enqueue_entity(cfs_rq, se, flags);
88ec22d3edb72 (Peter Zijlstra 2009-12-16 18:04:41 +0100 1139) flags = ENQUEUE_WAKEUP;
bf0f6f24a1ece (Ingo Molnar 2007-07-09 18:51:58 +0200 1140) }
8f4d37ec073c1 (Peter Zijlstra 2008-01-25 21:08:29 +0100 1141)
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1142) for_each_sched_entity(se) {
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1143) struct cfs_rq *cfs_rq = cfs_rq_of(se);
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1144)
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1145) update_cfs_load(cfs_rq);
2069dd75c7d0f (Peter Zijlstra 2010-11-15 15:47:00 -0800 1146) update_cfs_shares(cfs_rq);

Wake up

sched: lockless wake-queues
Futex Scaling for Multi-core SystemsSlides

Program-Order guarantees

commit 8643cda549ca49a403160892db68504569ac9052
Author: Peter Zijlstra peterz@infradead.org
Date: Tue Nov 17 19:01:11 2015 +0100
sched/core, locking: Document Program-Order guarantees

LKML discussions

scheduler ordering bits
scheduler ordering bits -v2

pi_lock

commit b29739f902ee76a05493fb7d2303490fc75364f4
Author: Ingo Molnar mingo@elte.hu
Date: Tue Jun 27 02:54:51 2006 -0700
[PATCH] pi-futex: scheduler support for pi
Add framework to boost/unboost the priority of RT tasks.

rq->lock in schedule() and context_switch()

commit 3a5f5e488ceee9e08df3dff3f01b12fafc9e7e68
Author: Ingo Molnar mingo@elte.hu
Date: Fri Jul 14 00:24:27 2006 -0700
[PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW
+ * Since the runqueue lock will be released by the next
+ * task

Running time

proc_sched_show_task

Problems

Why scheduling?

Customers demand multitasking/concurrent
Processes are blocked

Fairness

Unit: /proc/sys/kernel/sched_min_granularity_ns

Conceptions

Cpu capacity

Running Compensator records the running process

scheduler_tick
{
update_rq_clock
task_tick_fair -> entity_tick
{
update_curr
{
sum_exec_runtime - total runtime
cfs_rq->exec_clock - cfs_rq runtime
vruntime - inverse proportion to the weight or priority
update_min_vruntime
{
cfs_rq->curr, leftmost, min_vruntime, who is min?
}
cpuacct - cpu sys/user time
}
}
}

Next -> pick_next_task_fair

put_prev_entity: update_curr; insert into rb-tree;
pick_next_entity: left most of rb-tree.
set_next_entity: remove next from tree since it will disturb inserting and deleting when it is being updated.

Unrunnable

dequeue_task

Resuming

try_to_wake_up->ttwu_queue->ttwu_do_activate-> or local wakeup: schedule->try_to_wake_up_local->
{
ttwu_activate #=== speical compensation and enqueue rq
{
activate_task
p->on_rq = TASK_ON_RQ_QUEUED #=== 1) rq for task; 2)
}
ttwu_do_wakeup #=== normal compensation
{
check_preempt_curr
p->state = TASK_RUNNING;
}
}

enqueue_task-> place_entity compensation for wakeup process

wake up a sleep task

se->on_rq & TASK_ON_RQ_QUEUED; deactivate_task set on_rq to 0;
enqueue_task_fair handles group stuff
enqueue_entity deals with sched_entity - uptodate the vruntime, load average, account load numa perfering,
sysctl_sched_latency: the cfs pledge to the pre-existing tasks that they have 6ms to run before new task to run.
try_to_wake_up_local for local task
try_to_wake_up for any task

New task

speical debit compensation: sched_fork->task_fork_fair->place_entity - compensation for new process
normal compensation: wake_up_new_task
{
activate_task #=== speical compensation
check_preempt_curr #=== normal compensation
}

Priority

  • weight
  • priority
    DEFAULT_PRIO
    fs/proc/array.c

Latency

  1. sched_nr_latency= /proc/sys/kernel/sched_latency_ns / /proc/sys/kernel/sched_min_granularity_ns
  2. if running process > sched_nr_latency, latency cannot be ensured. just focus on min granularity

    LQO

  3. is the difference of leftmost and rightmost smaller than sched_min_granularity_ns??

  4. sched_slice

    Energy

    blocked & schedule
    check preempt & schedule
    check_preempt_tick # new preempts curr
    {
    curr running time > sched_slice # enough time to yield.
    curr - leftmost > sched_slice # nice to others.
    }
    check_preempt_wakeup # the wakeuped preempts curr
    {
    curr - wakeuped > sysctl_sched_wakeup_granularity; # pass the wakeup-preempt-delay
    }

    io wait

    https://lwn.net/Articles/342378/

    Load avg

    update_loadavg
    https://en.wikipedia.org/wiki/Load
    (computing)
    Check External links
    calc_load_fold_active
    Etymology of avenrun: average nr. of running processes during

    LQO

  5. h_nr_running and throttled
    sched: Implement hierarchical task accounting for SCHED_OTHER - 953bfcd10e6f3697233e8e5128c611d275da39c1
    https://groups.google.com/forum/#!topic/linux.kernel/gRzxHclMy50
    ‘root’
    \
    ‘A’
    / \
    t1 t2
    root.nr_running := 2
    root.h_nr_running := 2
    Check enqueue_task_fair()

  6. idle
    https://www.kernel.org/doc/Documentation/scheduler/sched-arch.txt
    improve SMP reschedule and idle routines
    TIF_POLLING_NRFLAG -> Need-Resched-Flag?

  7. process migration
    e761b7725234276a802322549cee5255305a0930
    Introduce cpu_active_map and redo sched domain managment
    When to migration
    sched_setaffinity __set_cpus_allowed_ptr manuly
    Selecting a new CPU during wak up a sleeper
    For balancing, selecting CPU during wake up new process in _do_fork
    execve’s sched_exec

  8. shceduler clock
    rq->clock is nano seconds?

  9. clock_task and wraps
    fe44d62122829959e960bc699318d58966922a69

  • START_DEBIT
    no standalone commit
    bf0f6f24a1ece8988b243aefe84ee613099a9245
  • why ahead?
    8 /*
    9 * Place new tasks ahead so that they do not starve already running
    10 * tasks
    11 */
    12 SCHED_FEAT(START_DEBIT, true)
    the tree is named timeline
  • Improving scheduler latency
  • skip next last buddy


Git log

e9c8431185d6c406887190519f6dbdd112641686
TASK_WAKING; see migrate_task_rq_fair and try_to_wake_up
88ec22d3edb72b261f8628226cd543589a6d5e1b
In order to remove the cfs_rq dependency from set_task_cpu() we need to ensure the task is cfs_rq invariant for all callsites.
2f950354e6d535b892f133d20bd6a8b09430424c
sched/fair: Fix fairness issue on migration
Migrated CFS task getting an unfair advantage
30cfdcfc5f180fc21a3dad6ae3b7b2a9ee112186
curr was not kept in rb-tree

Load balancing

Scheduling domains
set sd
kernel_init_freeable->
sched_init_smp->
init_sched_domains->build_sched_domains:->
visit_domain_allocation_hell()->sdt_alloc() alloc the sdd->sg which is used by build groups
and sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(); it covered the size of cpumask
/* Build the groups for the domains */
detach_destroy_domains
cpu_attach_domain

CONFIG_SCHED_MC=y
static noinline struct sched_domain *
sdinit##type(struct sched_domain_topology_level *tl, int cpu)
{
struct sched_domain *sd = *per_cpuptr(tl->data.sd, cpu);
*sd = SD
##type##_INIT;
SD_INIT_NAME(sd, type);
sd->private = &tl->data;
return sd;
}
tl->mask(cpu)
static struct sched_domain_topology_level default_topology[] = {
#ifdef CONFIG_SCHED_SMT
{ sd_init_SIBLING, cpu_smt_mask, },
#endif
#ifdef CONFIG_SCHED_MC
{ sd_init_MC, cpu_coregroup_mask, },
#endif
#ifdef CONFIG_SCHED_BOOK
{ sd_init_BOOK, cpu_book_mask, },
#endif
{ sd_init_CPU, cpu_cpu_mask, },
{ NULL, },
};

Leaf CFS runqueues leaf_cfs_rq

First

commit 6aa645ea5f7a246702e07f29edc7075d487ae4a3
Refs: v2.6.22-14-g6aa645ea5f7a
Author: Ingo Molnar mingo@elte.hu
AuthorDate: Mon Jul 9 18:51:58 2007 +0200
Commit: Ingo Molnar mingo@elte.hu
CommitDate: Mon Jul 9 18:51:58 2007 +0200
sched: cfs rq data types
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
* leaf_cfs_rq_list ties together list of leaf cfs_rq’s in a cpu. This
* list is used during load balance.
Head of list: rq->leaf_cfs_rq_list

Core load_balance_fair

commit bf0f6f24a1ece8988b243aefe84ee613099a9245
Refs: v2.6.22-10-gbf0f6f24a1ec
Author: Ingo Molnar mingo@elte.hu
AuthorDate: Mon Jul 9 18:51:58 2007 +0200
Commit: Ingo Molnar mingo@elte.hu
CommitDate: Mon Jul 9 18:51:58 2007 +0200
sched: cfs core, kernel/sched_fair.c
add kernel/sched_fair.c - which implements the bulk of CFS’s
behavioral changes for SCHED_OTHER tasks.
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+ for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {

make parent appear after us.

commit 67e86250f8ea7b8f7da53ac25ea73c6bd71f5cd9
Author: Paul Turner pjt@google.com
Date: Mon Nov 15 15:47:05 2010 -0800
sched: Introduce hierarchal order on shares update list
Avoid duplicate shares update calls by ensuring children always appear before # leaf’s meaning is changed
parents in rq->leaf_cfs_rq_list.
This allows us to do a single in-order traversal for update_shares().
Since we always enqueue in bottom-up order this reduces to 2 cases:
1) Our parent is already in the list, e.g.
root

b
/
c d* (root->b->c already enqueued)
Since d’s parent is enqueued we push it to the head of the list, implicitly ahead of b.
2) Our parent does not appear in the list (or we have no parent)
In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
(bottom-up) it will appear to our right by the same rule.

tmp_alone_branch

commit 9c2791f936ef5fd04a118b5c284f2c9a95f4a647
Refs: v4.9-rc5-195-g9c2791f936ef
Author: Vincent Guittot vincent.guittot@linaro.org
AuthorDate: Tue Nov 8 10:53:43 2016 +0100
Commit: Ingo Molnar mingo@kernel.org
CommitDate: Wed Nov 16 10:29:08 2016 +0100
sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list
Fix the insertion of cfs_rq in rq->leaf_cfs_rq_list to ensure that a
child will always be called before its parent.
The hierarchical order in shares update list has been introduced by
commit:
67e86250f8ea (“sched: Introduce hierarchal order on shares update list”)

commit 5d299eabea5a251fbf66e8277704b874bbba92dc
Author: Peter Zijlstra peterz@infradead.org
Date: Wed Jan 30 14:41:04 2019 +0100
sched/fair: Add tmp_alone_branch assertion
The magic in list_add_leaf_cfs_rq() requires that at the end of
enqueue_task_fair():
rq->tmp_alone_branch == &rq->lead_cfs_rq_list

load_balance_fair - removed

commit 9763b67fb9f3050c6da739105888327587c30c4d
Refs: v3.0-rc7-197-g9763b67fb9f3
Author: Peter Zijlstra a.p.zijlstra@chello.nl
AuthorDate: Wed Jul 13 13:09:25 2011 +0200
Commit: Ingo Molnar mingo@elte.hu
CommitDate: Thu Jul 21 18:01:46 2011 +0200
sched, cgroup: Optimize load_balance_fair()
Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

Throttling entities

commit 85dac906bec3bb41bfaa7ccaa65c4706de5cfdf8
Author: Paul Turner pjt@google.com
Date: Thu Jul 21 09:43:33 2011 -0700
sched: Add support for throttling group entities
Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.