linux-kernel - Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJz2qX=P7b_Yv6yBhaqQmW7bsP0uEscC-mZRON5s30Cbgya7OA@mail.gmail.com>
Date:   Thu, 7 May 2020 21:01:16 +0300
From:   John Mathew <john.mathew@...kie.com>
To:     Randy Dunlap <rdunlap@...radead.org>
Cc:     linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        corbet@....net, mingo@...hat.com, peterz@...radead.org,
        juri.lelli@...hat.com,
        Vincent Guittot <vincent.guittot@...aro.org>,
        dietmar.eggemann@....com, rostedt@...dmis.org,
        Benjamin Segall <bsegall@...gle.com>, mgorman@...e.de,
        bristot@...hat.com, tsbogend@...ha.franken.de,
        Lukas Bulwahn <lukas.bulwahn@...il.com>, x86@...nel.org,
        linux-mips@...r.kernel.org, tglx@...utronix.de,
        mostafa.chamanara@...il.com,
        Oleg Tsymbal <oleg.tsymbal@...kie.com>, willy@...radead.org,
        valentin.schneider@....com,
        Mostafa Chamanara <mostafa.chamanara@...emark.com>
Subject: Re: [RFC PATCH v2 2/3] docs: scheduler: Add scheduler overview documentation

On Thu, May 7, 2020 at 6:41 AM Randy Dunlap <rdunlap@...radead.org> wrote:
>
> Hi--
>
> On 5/6/20 7:39 AM, john mathew wrote:
> > From: John Mathew <john.mathew@...kie.com>
> >
> > Add documentation for
> >  -scheduler overview
> >  -scheduler state transtion
> >  -CFS overview
> >  -scheduler data structs
> >
> > Add rst for scheduler APIs and modify sched/core.c
> > to add kernel-doc comments.
> >
> > Suggested-by: Lukas Bulwahn <lukas.bulwahn@...il.com>
> > Co-developed-by: Mostafa Chamanara <mostafa.chamanara@...emark.com>
> > Signed-off-by: Mostafa Chamanara <mostafa.chamanara@...emark.com>
> > Co-developed-by: Oleg Tsymbal <oleg.tsymbal@...kie.com>
> > Signed-off-by: Oleg Tsymbal <oleg.tsymbal@...kie.com>
> > Signed-off-by: John Mathew <john.mathew@...kie.com>
> > ---
> >  Documentation/scheduler/cfs-overview.rst      | 110 +++++++
> >  Documentation/scheduler/index.rst             |   3 +
> >  Documentation/scheduler/overview.rst          | 269 ++++++++++++++++++
> >  .../scheduler/sched-data-structs.rst          | 253 ++++++++++++++++
> >  Documentation/scheduler/scheduler-api.rst     |  30 ++
> >  kernel/sched/core.c                           |  28 +-
> >  kernel/sched/sched.h                          | 169 ++++++++++-
> >  7 files changed, 855 insertions(+), 7 deletions(-)
> >  create mode 100644 Documentation/scheduler/cfs-overview.rst
> >  create mode 100644 Documentation/scheduler/sched-data-structs.rst
> >  create mode 100644 Documentation/scheduler/scheduler-api.rst
> >
> > Request review from Valentin Schneider <valentin.schneider@....com>
> > for the section describing Scheduler classes in:
> > .../scheduler/sched-data-structs.rst
> >
> > diff --git a/Documentation/scheduler/cfs-overview.rst b/Documentation/scheduler/cfs-overview.rst
> > new file mode 100644
> > index 000000000000..50d94b9bb60e
> > --- /dev/null
> > +++ b/Documentation/scheduler/cfs-overview.rst
> > @@ -0,0 +1,110 @@
> > +.. SPDX-License-Identifier: GPL-2.0+
> > +
> > +=============
> > +CFS Overview
> > +=============
> > +
> > +Linux 2.6.23 introduced a modular scheduler core and a Completely Fair
> > +Scheduler (CFS) implemented as a scheduling module. A brief overview of the
> > +CFS design is provided in :doc:`sched-design-CFS`
> > +
> > +In addition there have been many improvements to the CFS, a few of which are
> > +
> > +**Thermal Pressure**:
> > +cpu_capacity initially reflects the maximum possible capacity of a CPU.
> > +Thermal pressure on a CPU means this maximum possible capacity is
> > +unavailable due to thermal events. Average thermal pressure for a CPU
> > +is now subtracted from its maximum possible capacity so that cpu_capacity
> > +reflects the remaining maximum capacity.
> > +
> > +**Use Idle CPU for NUMA balancing**:
> > +Idle CPU is used as a migration target instead of comparing tasks.
> > +Information on an idle core is cached while gathering statistics
> > +and this is used to avoid a second scan of the node runqueues if load is
> > +not imbalanced. Preference is given to an idle core rather than an
> > +idle SMT sibling to avoid packing HT siblings due to linearly scanning
> > +the node cpumask.  Multiple tasks can attempt to select and idle CPU but
> > +fail, in this case instead of failing, an alternative idle CPU scanned.
>
> I'm having problems parsing that last sentence above.
Fixed as follows in v3:
Multiple tasks can attempt to select an idle CPU but
fail because a NUMA balance is active on that CPU, in this case instead of
failing, an alternative idle CPU scanned.
>
> > +
> > +**Asymmetric CPU capacity wakeup scan**:
> > +Previous assumption that CPU capacities within an SD_SHARE_PKG_RESOURCES
> > +domain (sd_llc) are homogeneous didn't hold for newer generations of big.LITTLE
> > +systems (DynamIQ) which can accommodate CPUs of different compute capacity
> > +within a single LLC domain. A new idle sibling helper function was added
> > +which took CPU capacity in to account. The policy is to pick the first idle
>
>                            into
Fixed in v3.
>
> > +CPU which is big enough for the task (task_util * margin < cpu_capacity).
>
> why not <= ?
This is how it is implemented in fair.c
/*
* The margin used when comparing utilization with CPU capacity.
*
* (default: ~20%)
*/
#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>
> > +If no idle CPU is big enough, the idle CPU with the highest capacity was
>
> s/was/is/
Fixed in v3.
>
> > +picked.
> > +
> > +**Optimized idle core selection**:
> > +Previously all threads of a core were looped through to evaluate if the
> > +core is idle or not. This was unnecessary. If a thread of a core is not
> > +idle, skip evaluating other threads of a core. Also while clearing the
> > +cpumask, bits of all CPUs of a core can be cleared in one-shot.
>
>                                                       in one shot.
Fixed in v3.
>
> > +
> > +**Load balance aggressively for SCHED_IDLE CPUs**:
> > +The fair scheduler performs periodic load balance on every CPU to check
> > +if it can pull some tasks from other busy CPUs. The duration of this
> > +periodic load balance is set to scheduler domain's balance_interval and
> > +multiplied by a busy_factor (set to 32 by default) for the busy CPUs. This
> > +multiplication is done for busy CPUs to avoid doing load balance too
> > +often and rather spend more time executing actual task. While that is
> > +the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH
> > +tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE
> > +tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE
> > +CPUs, it is now preferred to enqueue a newly-woken task to a SCHED_IDLE
> > +CPU instead of other busy or idle CPUs. The same reasoning is applied
> > +to the load balancer as well to make it migrate tasks more aggressively
> > +to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the
> > +migrated (SCHED_OTHER) tasks. Fair scheduler now does the next
> > +load balance soon after the last non SCHED_IDLE task is dequeued from a
>
>                                     non-SCHED_IDLE
Fixed in v3.
>
> > +runqueue, i.e. making the CPU SCHED_IDLE.
> > +
> > +**Load balancing algorithm Reworked**:
> > +The load balancing algorithm contained some heuristics which became
> > +meaningless since the rework of the scheduler's metrics like the
> > +introduction of PELT. The new load balancing algorithm fixes several
> > +pending wrong tasks placement
> > +- the 1 task per CPU case with asymmetric system
> > +- the case of cfs task preempted by other class
>
> s/cfs/CFS/
Fixed in v3.
>
> > +- the case of tasks not evenly spread on groups with spare capacity
>
> Can you make that (above) a proper ReST list?
>
> > +Also the load balance decisions have been consolidated in the 3 separate
> > +functions
>
> end with '.' period.
Fixed in v3.
>
> > +
> > +**Energy-aware wake-ups speeded up**:
> > +EAS computes the energy impact of migrating a waking task when deciding
> > +on which CPU it should run. However, the previous approach had high algorithmic
> > +complexity, which can resulted in prohibitively high wake-up latencies on
>
>                drop: can
> or say         which can result
Fixed in v3.
>
> > +systems with complex energy models, such as systems with per-CPU DVFS. On
> > +such systems, the algorithm complexity was O(n^2). To address this,
> > +the EAS wake-up path was re-factored to compute the energy 'delta' on a
> > +per-performance domain basis, rather than system-wide, which brings the
> > +complexity down to O(n).
> > +
> > +**Selection of an energy-efficient CPU on task wake-up**:
> > +If an Energy Model (EM) is available and if the system isn't overutilized,
> > +waking tasks are re-routed into an energy-aware placement algorithm.
> > +The selection of an energy-efficient CPU for a task is achieved by estimating
> > +the impact on system-level active energy resulting from the placement of the
> > +task on the CPU with the highest spare capacity in each performance domain.
> > +This strategy spreads tasks in a performance domain and avoids overly
> > +aggressive task packing. The best CPU energy-wise is then selected if it
> > +saves a large enough amount of energy with respect to prev_cpu.
> > +
> > +**Consider misfit tasks when load-balancing**:
> > +On asymmetric CPU capacity systems load intensive tasks can end up on
> > +CPUs that don't suit their compute demand. In this scenarios 'misfit'
>
>                                                       scenario
Fixed in v3.
>
> > +tasks are migrated to CPUs with higher compute capacity to ensure better
> > +throughput. A new group_type: group_misfit_task is added and indicates this
> > +scenario. Tweaks to the load-balance code are done to make the migrations
> > +happen. Misfit balancing is done between a source group of lower per-CPU
> > +capacity and destination group of higher compute capacity. Otherwise, misfit
> > +balancing is ignored.
> > +
> > +**Make schedstats a runtime tunable that is disabled by default**:
> > +schedstats is very useful during debugging and performance tuning but it
> > +incurred overhead to calculate the stats. A kernel command-line and sysctl
> > +tunable was added to enable or disable schedstats on demand (when it's built in).
> > +It is disabled by default. The benefits are dependent on how
> > +scheduler-intensive the workload is.
> > +
> > diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> > index ede1a30a6894..b952970d3565 100644
> > --- a/Documentation/scheduler/index.rst
> > +++ b/Documentation/scheduler/index.rst
> > @@ -17,10 +17,13 @@ specific implementation differences.
> >      :maxdepth: 2
> >
> >      overview
> > +    sched-data-structs
> > +    cfs-overview
> >      sched-design-CFS
> >      sched-features
> >      arch-specific.rst
> >      sched-debugging.rst
> > +    scheduler-api.rst
>
> Why do some of these end with ".rst" and others don't?
Removed the .rst for all the files in the index in v3.
>
> >
> >  .. only::  subproject and html
> >
> > diff --git a/Documentation/scheduler/overview.rst b/Documentation/scheduler/overview.rst
> > index aee16feefc61..284d6cf0b2f8 100644
> > --- a/Documentation/scheduler/overview.rst
> > +++ b/Documentation/scheduler/overview.rst
> > @@ -3,3 +3,272 @@
> >  ====================
> >  Scheduler overview
> >  ====================
> > +
> > +Linux kernel implements priority based scheduling. More than one process are
>
>                            priority-based
Fixed in v3.
>
> > +allowed to run at any given time and each process is allowed to run as if it
> > +were the only process on the system. The process scheduler coordinates which
> > +process runs when. In that context, it has the following tasks:
> > +
> > +- share CPU cores equally among all currently running processes
> > +- pick appropriate process to run next if required, considering scheduling
> > +  class/policy and process priorities
> > +- balance processes between multiple cores in SMP systems
> > +
> > +The scheduler attempts to be responsive for I/O bound processes and efficient
> > +for CPU bound processes. The scheduler also applies different scheduling
> > +policies for real time and normal processes based on their respective
> > +priorities.  Higher priorities in the kernel have a numerical smaller
> > +value. Real time priorities range from 1 (highest) – 99 whereas normal
> > +priorities range from 100 – 139 (lowest). SCHED_DEADLINE tasks has negative
>
>
Fixed in v3.
        have
>
> > +priorities, reflecting the fact that any of them has higher priority than
> > +RT and NORMAL/BATCH tasks.
> > +
> > +Process Management
> > +==================
> > +
> > +Each process in the system is represented by :c:type:`struct task_struct
> > +<task_struct>`. When a process/thread is created, the kernel allocates a
> > +new task_struct for it. The kernel then stores this task_struct in a RCU
>
>                                                                       an RCU
Fixed in v3.
>
> > +list.  Macro next_task() allow a process to obtain its next task and
>
>                             allows
>
> > +for_each_process() macro enables traversal of the list.
> > +
> > +Frequently used fields of the task struct are:
> > +
> > +| *state:*  The running state of the task. The possible states are:
> > +
> > +- TASK_RUNNING: The task is currently running or in a run queue waiting
> > +  to run.
> > +- TASK_INTERRUPTIBLE: The task is sleeping waiting for some event to occur.
> > +  This task can be interrupted by signals. On waking up the task transitions
> > +  to TASK_RUNNING.
> > +- TASK_UNINTERRUPTIBLE: Similar to TASK_INTERRUPTIBLE but does not wake
> > +  up on signals. Needs an explicit wake-up call to be woken up.  Contributes
> > +  to loadavg.
> > +- __TASK_TRACED: Task is being traced by another task like a debugger.
> > +- __TASK_STOPPED: Task execution has stopped and not eligible to run.
> > +  SIGSTOP, SIGTSTP etc causes this state.  The task can be continued by
> > +  the signal SIGCONT.
> > +- TASK_PARKED: State to support kthread parking/unparking.
> > +- TASK_DEAD: If a task dies, then it sets TASK_DEAD in tsk->state and calls
> > +  schedule one last time. The schedule call will never return.
> > +- TASK_WAKEKILL: It works like TASK_UNINTERRUPTIBLE with the bonus that it
> > +  can respond to fatal signals.
> > +- TASK_WAKING: To handle concurrent waking of the same task for SMP.
> > +  Indicates that someone is already waking the task.
> > +- TASK_NOLOAD:  To be used along with TASK_UNINTERRUPTIBLE to indicate
> > +  an idle task which does not contribute to loadavg.
> > +- TASK_NEW: Set during fork(), to guarantee that no one will run the task,
> > +  a signal or any other wake event cannot wake it up and insert it on
> > +  the runqueue.
> > +
> > +| *exit_state* : The exiting state of the task. The possible states are:
> > +
> > +- EXIT_ZOMBIE:  The task is terminated and waiting for parent to collect
> > +  the exit information of the task.
> > +- EXIT_DEAD:  After collecting the exit information the task is put to
> > +  this state and removed from the system.
> > +
> > +| *static_prio:*  Nice value of a task. The value of this field does
> > +   not change.  Value ranges from -20 to 19.  This value is mapped
> > +   to nice value and used in the scheduler.
> > +
> > +| *prio:*  Dynamic priority of a task. Previously a function of static
> > +   priority and tasks interactivity. Value not used by CFS scheduler but used
> > +   by the rt scheduler. Might be boosted by interactivity modifiers.  Changes
>
>              RT
>
> > +   upon fork, setprio syscalls, and whenever the interactivity estimator
> > +   recalculates.
> > +
> > +| *normal_prio:* Expected priority of a task.  The value of static_prio
> > +   and normal_prio are the same for non real time processes.  For real time
>
>                                        non-real-time
>
> > +   processes value of prio is used.
> > +
> > +| *rt_priority:* Field used by real time tasks. Real time tasks are
> > +   prioritized based on this value.
> > +
> > +| *sched_class:* Pointer to sched_class CFS structure.
> > +
> > +| *sched_entity:* Pointer to sched_entity CFS structure.
> > +
> > +| *policy:* Value for scheduling policy.  The possible values are:
> > +
> > +* SCHED_NORMAL: Regular tasks use this policy.
> > +
> > +* SCHED_BATCH: Tasks which need to run longer without pre-emption
>
> overwhelmingly the kernel spells this as                 preemption
Fixed in all places in v3.
>
> > +  use this policy. Suitable for batch jobs.
> > +
> > +* SCHED_IDLE:  Policy used by background tasks.
> > +
> > +* SCHED_FIFO & SCHED_RR: These policies for real time tasks.  Handled
> > +  by real time scheduler.
> > +
> > +* SCHED_DEADLINE: Tasks which are activated on a periodic or sporadic fashion
> > +  use this policy. This policy implements the Earliest Deadline First (EDF)
> > +  scheduling algorithm. This policy is explained in detail in the
> > +  :doc:`sched-deadline` documentation.
> > +
> > +| *nr_cpus_allowed:*  Bit field containing tasks affinity towards a set of
> > +   cpu cores.  Set using sched_setaffinity() system call.
>
>       CPU
Fixed in all places in v3.
>
> > +
> > +New processes are created using the fork() system call which is described
> > +at manpage :manpage:`FORK(2)` or the clone system call described at
> > +:manpage:`CLONE(2)`.
> > +Users can create threads within a process to achieve parallelism. Threads
> > +share address space, open files and other resources of the process. Threads
> > +are created like normal tasks with their unique task_struct, but the clone()
>
>                                                                 but clone()
>
> > +is provided with flags that enable the sharing of resources such as address
> > +space ::
> > +
> > +     clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
> > +
> > +The scheduler schedules task_structs so from scheduler perspective there is
> > +no difference between threads and processes. Threads are created using
> > +the system call pthread_create described at :manpage:`PTHREAD_CREATE(3)`
> > +POSIX threads creation is described at :manpage:`PTHREADS(7)`
> > +
> > +The Scheduler Entry Point
> > +=========================
> > +
> > +The main scheduler entry point is an architecture independent schedule()
> > +function defined in kernel/sched.c. Its objective is to find a process in
> > +the runqueue list and then assign the CPU to it. It is invoked, directly
> > +or in a lazy(deferred) way from many different places in the kernel. A lazy
>
>            lazy (deferred)
Fixed in v3.
>
> > +invocation does not call the function by its name, but gives the kernel a
> > +hint by setting a flag TIF_NEED_RESCHED. The flag is a message to the kernel
> > +that the scheduler should be invoked as soon as possible because another
> > +process deserves to run.
> > +
> > +Following are some places that notify the kernel to schedule:
> > +
> > +* scheduler_tick()
> > +
> > +* Running task goes to sleep state : Right before a task goes to sleep,
> > +  schedule() will be called to pick the next task to run and the change
> > +  its state to either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.  For
> > +  instance, prepare_to_wait() is one of the functions that makes the
> > +  task go to the sleep state.
> > +
> > +* try_to_wake_up()
> > +
> > +* yield()
> > +
> > +* wait_event()
> > +
> > +* cond_resched() : It gives the scheduler a chance to run a
> > +  higher-priority process
>
> end with '.' period.
Fixed in v3.
>
> > +
> > +* cond_resched_lock() : If a reschedule is pending, drop the given lock,
> > +  call schedule, and on return reacquire the lock.
> > +
> > +* do_task_dead()
> > +
> > +* preempt_schedule() : The function checks whether local interrupts are
> > +  enabled and the preempt_count field of current is zero; if both
> > +  conditions are true, it invokes schedule() to select another process
> > +  to run.
> > +
> > +* preempt_schedule_irq()
> > +
> > +Calling functions mentioned above leads to a call to __schedule(), note
>
>                                                         __schedule(). Note
>
> > +that preemption must be disabled before it is called and enabled after
> > +the call using preempt_disable and preempt_enable functions family.
> > +
> > +
> > +The steps during invocation are:
> > +--------------------------------
> > +1. Disable pre-emption to avoid another task pre-empting the scheduling
>
>               preemption                        preempting
>
> > +   thread itself.
> > +2. Retrieve the runqueue of current processor and its lock is obtained to
> > +   allow only one thread to modify the runqueue at a time.
> > +3. The state of the previously executed task when the schedule()
> > +   was called is examined.  If it is not runnable and has not been
> > +   pre-empted in kernel mode, it is removed from the runqueue. If the
>
>       preempted
>
> > +   previous task has non-blocked pending signals, its state is set to
> > +   TASK_RUNNING and left in the runqueue.
> > +4. Scheduler classes are iterated and the corresponding class hook to
> > +   pick the next suitable task to be scheduled on the CPU is called.
> > +   Since most tasks are handled by the sched_fair class, a short cut to this
>
>                                                               shortcut
>
> > +   class is implemented in the beginning of the function.
> > +5. TIF_NEED_RESCHED and architecture specific need_resched flags are cleared.
> > +6. If the scheduler class picks a different task from what was running
> > +   before, a context switch is performed by calling context_switch().
> > +   Internally, context_switch() switches to the new task's memory map and
> > +   swaps the register state and stack. If scheduler class picked the same
> > +   task as the previous task, no task switch is performed and the current
> > +   task keeps running.
> > +7. Balance callback list is processed. Each scheduling class can migrate tasks
> > +   between CPU's to balance load. These load balancing operations are queued
>
>               CPUs
>
> > +   on a Balance callback list which get executed when the balance_callback()
>
> either                                              when balance_callback()
> or                                                  when the balanace_callback() function
Fixed in v3.
>
> > +   is called.
> > +8. The runqueue is unlocked and pre-emption is re-enabled. In case
>
>                                    preemption
>
> > +   pre-emption was requested during the time in which it was disabled,
>
>       preemption
>
> > +   schedule() is run again right away.
> > +
> > +Scheduler State Transition
> > +==========================
> > +
> > +A very high level scheduler state transition flow with a few states can
> > +be depicted as follows. ::
> > +
> > +                                       *
> > +                                       |
> > +                                       | task
> > +                                       | forks
> > +                                       v
> > +                        +------------------------------+
> > +                        |           TASK_NEW           |
> > +                        |        (Ready to run)        |
> > +                        +------------------------------+
> > +                                       |
> > +                                       |
> > +                                       v
> > +                     +------------------------------------+
> > +                     |            TASK_RUNNING            |
> > +   +---------------> |           (Ready to run)           | <--+
> > +   |                 +------------------------------------+    |
> > +   |                   |                                       |
> > +   |                   | schedule() calls context_switch()     | task is pre-empted
>
>                                                                             preempted
Fixed in v3.
>
> > +   |                   v                                       |
> > +   |                 +------------------------------------+    |
> > +   |                 |            TASK_RUNNING            |    |
> > +   |                 |             (Running)              | ---+
> > +   | event occurred  +------------------------------------+
> > +   |                   |
> > +   |                   | task needs to wait for event
> > +   |                   v
> > +   |                 +------------------------------------+
> > +   |                 |         TASK_INTERRUPTIBLE         |
> > +   |                 |        TASK_UNINTERRUPTIBLE        |
> > +   +-----------------|           TASK_WAKEKILL            |
> > +                     +------------------------------------+
> > +                                       |
> > +                                       | task exits via do_exit()
> > +                                       v
> > +                        +------------------------------+
> > +                        |          TASK_DEAD           |
> > +                        |         EXIT_ZOMBIE          |
> > +                        +------------------------------+
> > +
> > +
> > +Scheduler provides trace points tracing all major events of the scheduler.
> > +The tracepoints are defined in ::
>
> Can the document be consistent with (2 lines above:) "trace points" and
> (1 line above) "tracepoints"?
Fixed to tracepoints in v3.
>
> > +
> > +  include/trace/events/sched.h
> > +
> > +Using these treacepoints it is possible to model the scheduler state transition
>
>                spello
>
> > +in an automata model. The following journal paper discusses such modeling:
> > +
> > +Daniel B. de Oliveira, Rômulo S. de Oliveira, Tommaso Cucinotta, **A thread
> > +synchronization model for the PREEMPT_RT Linux kernel**, *Journal of Systems
> > +Architecture*, Volume 107, 2020, 101729, ISSN 1383-7621,
> > +https://doi.org/10.1016/j.sysarc.2020.101729.
> > +
> > +To model the scheduler efficiently the system was divided in to generators
> > +and specifications. Some of the generators used were "need_resched",
> > +"sleepable" and "runnable", "thread_context" and "scheduling context".
> > +The specifications are the necessary and sufficient conditions to call
> > +the scheduler.       New trace events were added to specify the generators
>
> Change tab above to space.
Fixed in v3
>
> > +and specifications.  In case a kernel event referred to more then one
> > +event,extra fields of the kernel event was used to distinguish between
>
>    event, extra
>
> > +automation events.  The final model was done parallel composition of all
>
> eh? parse error.
Fixed in v3.
>
> > +generators and specifications composed of 15 events, 7 generators and
> > +10 specifications.  This resulted in 149 states and 327 transitions.
> > diff --git a/Documentation/scheduler/sched-data-structs.rst b/Documentation/scheduler/sched-data-structs.rst
> > new file mode 100644
> > index 000000000000..52fe95140a8f
> > --- /dev/null
> > +++ b/Documentation/scheduler/sched-data-structs.rst
> > @@ -0,0 +1,253 @@
> > +.. SPDX-License-Identifier: GPL-2.0+
> > +
> > +=========================
> > +Scheduler Data Structures
> > +=========================
> > +
> > +The main parts of the Linux scheduler are:
> > +
> > +Runqueue
> > +~~~~~~~~
> > +
> > +:c:type:`struct rq <rq>` is the central data structure of process
> > +scheduling. It keeps track of tasks that are in a runnable state assigned
> > +for a particular processor. Each CPU has its own run queue and stored in a
> > +per CPU array::
> > +
> > +    DEFINE_PER_CPU(struct rq, runqueues);
> > +
> > +Access to the queue requires locking and lock acquire operations must be
> > +ordered by ascending runqueue. Macros for accessing and locking the runqueue
> > +is provided in::
>
>    are provided
>
> > +
> > +    kernel/sched/sched.h
> > +
> > +The runqueue contains scheduling class specific queues and several scheduling
> > +statistics.
> > +
> > +Scheduling entity
> > +~~~~~~~~~~~~~~~~~
> > +Scheduler uses scheduling entities which contain
> > +sufficient information to actually accomplish the scheduling job of a
> > +task or a task-group. The scheduling entity may be a group of tasks or a
> > +single task.  Every task is associated with a sched_entity structure. CFS
> > +adds support for nesting of tasks and task groups. Each scheduling entity
> > +may be run from its parents runqueue. The scheduler traverses the
> > +sched_entity hierarchy to pick the next task to run on
> > +the cpu.  The entity gets picked up from the cfs_rq on which it is queued
>
>        CPU.
>
> > +and its time slice is divided among all the tasks on its my_q.
> > +
> > +Virtual Runtime
> > +~~~~~~~~~~~~~~~~~
> > +Virtual Run Time or vruntime is the amount of time a task has spent running
> > +on the cpu. It is updated periodically by scheduler_tick(). Tasks are stored
>
>           CPU.
>
> > +in the CFS scheduling class rbtree sorted by vruntime. scheduler_tick() calls
> > +corresponding hook of CFS which first updates the runtime statistics of the
> > +currently running task and checks if the current task needs to be pre-empted.
>
>                                                                      preempted.
>
> > +vruntime of the task based on the formula ::
> > +
> > +    vruntime += delta_exec * (NICE_0_LOAD/curr->load.weight);
> > +
> > +where:
> > +
> > +* delta_exec is the time spent by the task since the last time vruntime
> > +  was updated.
>
> What unit is the time in?
Fixed to nanoseconds in v3.
>
> > +* NICE_0_LOAD is the load of a task with normal priority.
> > +* curr is the shed_entity instance of the cfs_rq struct of the currently
> > +  running task.
> > +* load.weight: sched_entity load_weight.  load_weight is the encoding of
> > +  the tasks priority and vruntime. The load of a task is the metri
>
>                                                                 metric
>
> > +  indicating the number of CPUs needed to make satisfactory progress on its
> > +  job. Load of a task influences the time a task spends on the cpu and also
>
>                                                                   CPU
>
> > +  helps to estimate the overall cpu load which is needed for load balancing.
>
>                                    CPU
>
> > +  Priority of the task is not enough for the scheduler to estimate the
> > +  vruntime of a process. So priority value must be mapped to the capacity of
> > +  the standard cpu which is done in the array :c:type:`sched_prio_to_weight[]`.
>
>                   CPU
>
> > +  The array contains mappings for the nice values from -20 to 19. Nice value
> > +  0 is mapped to 1024. Each entry advances by ~1.25 which means if for every
>
> Please use "about" or "approximately" etc. instead of "~" (if that is what is meant here).
Fixed to approximately in v3.
>
> > +  increment in nice value the task gets 10% less cpu and vice versa.
>
>                                                     CPU
>
> > +
> > +Scheduler classes
> > +~~~~~~~~~~~~~~~~~
> > +It is an extensible hierarchy of scheduler modules. The
> > +modules encapsulate scheduling policy details.
> > +They are called from the core code which is independent. Scheduling classes
> > +are implemented through the sched_class structure. dl_sched_class,
> > +fair_sched_class and rt_sched_class class are implementations of this class.
> > +
> > +The important methods of scheduler class are:
> > +
> > +enqueue_task and dequeue_task
> > +    These functions are used to put and remove tasks from the runqueue
> > +    respectively. The function takes the runqueue, the task which needs to
> > +    be enqueued/dequeued and a bit mask of flags. The main purpose of the
> > +    flags describe why the enqueue or dequeue is being called.
>
>        flags is to describe why
>
> > +    The different flags used are described in ::
> > +
> > +        kernel/sched/sched.h
> > +
> > +    enqueue_task and dequeue_task is called for following purposes.
>
>                                      are called
>
> > +Fixed in v3.
> > +    - When waking a newly created task for the first time. Called with
> > +      ENQUEUE_NOCLOCK
> > +    - When migrating a task from one CPU's runqueue to another. Task will be
> > +      first dequeued from its old runqueue, new cpu will be added to the
>
>                                                    CPU
>
> > +      task struct,  runqueue of the new CPU will be retrieved and task is
> > +      then enqueued on this new runqueue.
> > +    - When do_set_cpus_allowed() is called to change a tasks CPU affinity. If
> > +      the task is queued on a runqueue, it is first dequeued with the
> > +      DEQUEUE_SAVE and DEQUEUE_NOCLOCK flags set. The set_cpus_allowed()
> > +      function of the corresponding scheduling class will be called.
> > +      enqueue_task() is then called with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK
> > +      flags set.
> > +    - When changing the priority of a task using rt_mutex_setprio(). This
> > +      function implements the priority inheritance logic of the rt mutex
>
>                                                        preferably: RT
>
> > +      code. This function changes the effective priority of a task which may
> > +      inturn change the scheduling class of the task. If so enqueue_task is
>
>          in turn
>
> > +      called with flags corresponding to each class.
> > +    - When user changes the nice value of the task. If the task is queued on
> > +      a runqueue, it first needs to be dequeued, then its load weight and
> > +      effective priority needs to be set. Following which the task is
> > +      enqueued with ENQUEUE_RESTORE and ENQUEUE_NOCLOCK flags set.
> > +    - When __sched_setscheduler() is called. This function enables changing
> > +      the scheduling policy and/or RT priority of a thread. If the task is
> > +      on a runqueue, it will be first dequeued, changes will be made and
> > +      then enqueued.
> > +    - When moving tasks between scheduling groups. The runqueue of the tasks
> > +      is changed when moving between groups. For this purpose if the task
> > +      is running on a queue, it is first dequeued with DEQUEUE_SAVE, DEQUEUE_MOVE
> > +      and DEQUEUE_NOCLOCK flags set, followed by which scheduler function to
> > +      change the tsk->se.cfs_rq and tsk->se.parent and then task is enqueued
> > +      on the runqueue with the same flags used in dequeue.
> > +
> > +pick_next_task
> > +    Called by __schedule() to pick the next best task to run.
> > +    Scheduling class structure has a pointer pointing to the next scheduling
> > +    class type and each scheduling class is linked using a singly linked list.
> > +    The __schedule() function iterates through the corresponding
> > +    functions of the scheduler classes in priority order to pick up the next
> > +    best task to run. Since tasks belonging to the idle class and fair class
> > +    are frequent, the scheduler optimizes the picking of next task to call
> > +    the pick_next_task_fair() if the previous task was of the similar
> > +    scheduling class.
> > +
> > +put_prev_task
> > +    Called by the scheduler when a running task is being taken off a CPU.
> > +    The behavior of this function depends on individual scheduling classes
> > +    and called in the following cases.
> > +
> > +    - When do_set_cpus_allowed() is called and if the task is currently running.
> > +    - When scheduler pick_next_task() is called, the put_prev_task() is
> > +      called with the previous task as function argument.
> > +    - When rt_mutex_setprio() is called and if the task is currently running.
> > +    - When user changes the nice value of the task and if the task is
> > +      currently running.
> > +    - When __sched_setscheduler() is called and if the task is currently
> > +      running.
> > +    - When moving tasks between scheduling groups through the sched_move_task()
> > +      and if the task is ćurrently running.
> > +
> > +    In CFS class this function is used put the currently running task back
>
>                                      used to put
>
> > +    in to the CFS RB tree. When a task is running it is dequeued from the tree
>
>        into                                                                  tree.
>
>
> > +    This is to prevent redundant enqueue's and dequeue's for updating its
> > +    vruntime. vruntime of tasks on the tree needs to be updated by update_curr()
> > +    to keep the tree in sync. In DL and RT classes additional tree is
>
> None of the current sched documentation uses "DL" for deadline.
> It is used in some of the source code.  Anyway, if you keep using it, you
> should tell what it means somewhere.
Fixed to SCHED_DEADLINE in v3
>
> > +    maintained for facilitating task migration between CPUs through push
> > +    operation between runqueues for load balancing. Task will be added to
> > +    this queue if it is present on the scheduling class rq and task has
> > +    affinity to more than one CPU.
> > +
> > +set_next_task
> > +    Pairs with the put_prev_task(), this function is called when the next
> > +    task is set to run on the CPU. This function is called in all the places
> > +    where put_prev_task is called to complete the 'change'. Change is defined
> > +    as the following sequence of calls::
> > +
> > +         - dequeue task
> > +         - put task
> > +         - change the property
> > +         - enqueue task
> > +         - set task as current task
> > +
> > +    It resets the run time statistics for the entity with
> > +    the runqueue clock.
> > +    In case of CFS scheduling class, it will set the pointer to the current
> > +    scheduling entity to the picked task and accounts bandwidth usage on
> > +    the cfs_rq. In addition it will also remove the current entity from the
> > +    CFS runqueue for vruntime update optimization opposite to what was done
> > +    in put_prev_task.
> > +    For the DL and RT classes it will
> > +
> > +    - dequeue the picked task from the tree of pushable tasks
> > +    - update the load average in case the previous task belonged to another
> > +      class
> > +    - queues the function to push tasks from current runqueue to other CPUs
> > +      which can preempt and start execution. Balance callback list is used.
> > +
> > +task_tick
> > +    Called from scheduler_tick(), hrtick() and sched_tick_remote() to update
> > +    the current task statistics and load averages. Also restarting the HR
> > +    tick timer is done if HR timers are enabled.
>
> Likewise, "HR" is not currently used in any scheduler documentation.
Fixed to high resoution timer in v3
> At a minimum it needs a brief explanation.
>
> > +    scheduler_tick() runs at 1/HZ and is called from the  timer interrupt
>
>                                             drop one space ^^
>
> > +    handler of the Kernel internal timers.
> > +    hrtick() is called from HR Timers to deliver an accurate preemption tick.
>
>                                                             drop ending period ^^
>
> > +    as the regular scheduler tick that runs at 1/HZ can be too coarse when
> > +    nice levels are used.
> > +    sched_tick_remote() Gets called by the offloaded residual 1Hz scheduler
> > +    tick. In order to reduce interruptions to bare metal tasks, it is possible
> > +    to outsource these scheduler ticks to the global workqueue so that a
> > +    housekeeping CPU handles those remotely
>
> end with '.' period.
>
> > +
> > +select_task_rq
> > +    Called by scheduler to get the CPU to assign a task to and migrating
> > +    tasks between CPUs. Flags describe the reason the function was called.
> > +
> > +    Called by try_to_wake_up() with SD_BALANCE_WAKE flag which wakes up a
> > +    sleeping task.
> > +    Called by wake_up_new_task() with SD_BALANCE_FORK flag which wakes up a
> > +    newly forked task.
> > +    Called by sched_exec() wth  SD_BALANCE_EXEC which is called from execv
>
>                               with SD_BALANCE_EXEC (one less space there)
>
> > +    syscall.
> > +    DL class decides the CPU on which the task should be woken up based on
> > +    the deadline. and RT class decides based on the RT priority. Fair
>
>        the deadline. RT class decides
>
> > +    scheduling class     balances load by selecting the idlest CPU in the
>
>        fewer spaces  ^^^^^^
fixed in v3.
>
> > +    idlest group, or under certain conditions an idle sibling CPU if the
> > +    domain has SD_WAKE_AFFINE set.
> > +
> > +balance
> > +    Called by pick_next_task() from scheduler to enable scheduling classes
> > +    to pull tasks from runqueues of other CPUs for balancing task execution
> > +    between the CPUs.
> > +
> > +task_fork
> > +    Called from sched_fork() of scheduler which assigns a task to a CPU.
> > +    Fair scheduling class updates runqueue clock, runtime statistics and
> > +    vruntime for the scheduling entity.
> > +
> > +yield_task
> > +    Called from SYSCALL sched_yield to yield the CPU to other tasks.
> > +    DL class forces the runtime of the task to zero using a special flag
> > +    and dequeues the task from its trees. RT class requeues the task entities
> > +    to the end of the run list. Fair scheduling class implements the buddy
> > +    mechanism.  This allows skipping onto the next highest priority se at
>
>                                                                        se??
>
> > +    every level in the CFS tree, unless doing so would introduce gross
> > +    unfairness in CPU time distribution.
> > +
> > +check_preempt_curr
> > +    Check whether the task that woke up should pre-empt the currently
>
>                                                   preempt
>
> > +    running task. Called by scheduler,
> > +    - when moving queued task to new runqueue
> > +    - ttwu()
> > +    - when waking up newly created task for the first time.
> > +
> > +    DL class compare the deadlines of the tasks and calls scheduler function
>
>                 compares
>
> > +    resched_curr() if the preemption is needed. In case the deadliines are
>
>                                                                deadlines
>
> > +    equal migratilbility of the tasks is used a criteria for preemption.
>
>              migratability
>
> > +    RT class behaves the same except it uses RT priority for comparison.
> > +    Fair class sets the buddy hints before calling resched_curr() to preemempt.
>
>                                                                         preempt.
>
> > +
> > +Scheduler sets the scheduler class for each task based on its priority.
> > +Tasks assigned with SCHED_NORMAL, SCHED_IDLE and SCHED_BATCH call
> > +fair_sched_class hooks and tasks assigned with SCHED_RR and
> > +SCHED_FIFO call rt_sched_class hooks. Tasks assigned with SCHED_DEADLINE
> > +policy calls dl_sched_class hooks.
> > diff --git a/Documentation/scheduler/scheduler-api.rst b/Documentation/scheduler/scheduler-api.rst
> > new file mode 100644
> > index 000000000000..068cdbdbdcc6
> > --- /dev/null
> > +++ b/Documentation/scheduler/scheduler-api.rst
> > @@ -0,0 +1,30 @@
> > +.. SPDX-License-Identifier: GPL-2.0+
> > +
> > +=============================
> > +Scheduler related functions
> > +=============================
> > +
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: __schedule
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: scheduler_tick
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: try_to_wake_up
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: do_task_dead
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: preempt_schedule_irq
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: prepare_task_switch
> > +
> > +.. kernel-doc:: kernel/sched/core.c
> > +     :functions: finish_task_switch
> > +
> > +.. kernel-doc:: kernel/sched/sched.h
> > +     :functions: rq
> > \ No newline at end of file
>
fixed in v3.
> Please fix that warning.
>
> Thanks. This looks helpful.
>
> --
> ~Randy
>

-John