linux-kernel - [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250220093257.9380-1-kprateek.nayak@amd.com>
Date: Thu, 20 Feb 2025 09:32:35 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Valentin Schneider <vschneid@...hat.com>, "Ben
 Segall" <bsegall@...gle.com>, Thomas Gleixner <tglx@...utronix.de>, "Andy
 Lutomirski" <luto@...nel.org>, <linux-kernel@...r.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, "Sebastian Andrzej
 Siewior" <bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>,
	<linux-rt-devel@...ts.linux.dev>, Tejun Heo <tj@...nel.org>, "Frederic
 Weisbecker" <frederic@...nel.org>, Barret Rhoden <brho@...gle.com>, "Petr
 Mladek" <pmladek@...e.com>, Josh Don <joshdon@...gle.com>, Qais Yousef
	<qyousef@...alina.io>, "Paul E. McKenney" <paulmck@...nel.org>, David Vernet
	<dvernet@...a.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R.
 Shenoy" <gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>
Subject: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Hello everyone,

tl;dr

This is a spiritual successor to Valentin's series to defer CFS throttle
to user entry [1] however this does things differently: Instead of
moving to a per-task throttling, this retains the throttling at cfs_rq
level while tracking the tasks preempted in kernel mode queued on the
cfs_rq.

On a throttled hierarchy, only the kernel mode preempted entities are
considered runnable and the pick_task_fair() will only select among these
entities in EEVDF fashion using a min-heap approach (Patch 6-8)

This is an early prototype with few issues (please see "Broken bits and
disclaimer" section) however I'm putting it out to get some early
feedback.

This series is based on

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit d34e798094ca ("sched/fair: Refactor can_migrate_task() to
elimate looping")

With that quick tl;dr sorted, let us jump right into it

Problem statement
=================

Following are Valentin's words taken from [1] with few changes to the
reference commits now that PREEMPT_RT has landed upstream:

CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
  the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

For PREEMPT_RT, the upstream kernel added commit 49a17639508c ("softirq:
Use a dedicated thread for timer wakeups on PREEMPT_RT.") which helped
this scenario for non-rwlock locks by ensuring the throttled task would
get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks
cannot sanely do PI as they allow multiple readers.

Throttle deferral was discussed at OSPM'24 [2] and at LPC'24 [3] with
recordings for both available on Youtube.

Proposed approach
=================

This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".

The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.

Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.

            cfs_rq throttled               picked again
		  v                              v

    ----|*********| (preempted by tick / wakeup) |***********| (full throttle)

        ^                                                    ^
critical section   cfs_rq is throttled partially     critical section
      entry           since the task is still              exit
                  runnable as it was preempted in
                      kernel critical section

The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.

          root
        /     \
       A       B * (throttled)
      ...    / | \
            0  1* 2*

    (*) Preempted in kernel mode

  o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
  o avg_kcs_load = load(1) + load(2)

  o throttled_vruntime_eligible:

      entity preempted in kernel mode &&
      entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime

  o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
    that propagates the smallest vruntime of the all the entities in
    the subtree that are preempted in kernel mode. If they were
    executing in usermode when preempted, this will be set to LLONG_MAX.

    This is used to construct a min-heap and select through the
    entities. Consider rbtree of B to look like:

         1*
       /   \
      2*    0

      min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
      min_kcs_vruntime = min(se->min_kcs_vruntime,
                             __node_2_se(rb_left)->min_kcs_vruntime,
                             __node_2_se(rb_right)->min_kcs_vruntime);

   pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
   tree to first check the left subtree for eligibility, then the node
   itself, and then the right subtree.

With proactive tracking, throttle deferral can retain throttling at per
cfs_rq granularity instead of moving to a per-task model.

On the throttling side, a third throttle state is introduced
CFS_THROTTLED_PARTIAL which indicates that the cfs_rq has run out of its
bandwidth but contains runnable (kernel mode preempted) entities. A
partially throttled cfs_rq is added to the throttled list so it can be
unthrottled in a timely manner during distribution making all the tasks
queued on it runnable again. Throttle status can be promoted or demoted
depending on whether a kernel mode preempted tasks exits to user mode,
blocks, or wakes up on the hierarchy.

The most tricky part of the series is handling of the current task which
is kept out of kernel mode status tracking since a running task can make
multiple syscall and propagating those signals up the cgroup hierarchy
can be expensive. Instead the status of the current task is only looked
at during schedule() to make throttling decisions. This leads to some
interesting considerations in some paths (see curr_h_is_throttled() in
Patch 8 for EEVDF handling)

Testing
=======

The series was tested for correctness using a single hackbench instance
in a bandwidth controlled cgroup to observe the following set of events:

  o Throttle into partial state followed by a full throttle

   hackbench-1602    [044] dN.2.    55.684108: throttle_cfs_rq: cfs_rq throttled: (0) -> (1)
   hackbench-1602    [044] dN.2.    55.684111: pick_task_fair: Task picked on throttled hierarchy: hackbench(1602)
   hackbench-1602    [044] d....    55.684117: sched_notify_critical_section_exit: Pick on throttled requesting resched
   hackbench-1602    [044] dN.2.    55.684119: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
   ...
      <idle>-0       [044] dNh2.    55.689677: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (0)

  o Full throttle demoted to partial and then reupgraded

      <idle>-0       [006] dN.2.    55.592145: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (1)
      <idle>-0       [006] dN.2.    55.592147: pick_task_fair: Task picked on throttled hierarchy: hackbench(1584)
   ...
   hackbench-1584    [006] d....    55.592154: sched_notify_critical_section_exit: Pick on throttled requesting resched
   ...
   hackbench-1584    [006] dN.2.    55.592157: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
 
  [ Note: (1) corresponds to CFS_THROTTLE_PARTIAL (throttled but with
    kernel mode preempted entities that are still runnable) and (2)
    corresponds to CFS_THROTTLED (full throttle with no runnable task) ]

The stability testing was done by running the following script:

    #!/bin/bash
    
    mkdir /sys/fs/cgroup/CG0
    echo "+cpu" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00
    mkdir /sys/fs/cgroup/CG0/CG01
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00/CG000
    mkdir /sys/fs/cgroup/CG0/CG00/CG001
    mkdir /sys/fs/cgroup/CG0/CG01/CG000
    mkdir /sys/fs/cgroup/CG0/CG01/CG001
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "3200000 100000" > /sys/fs/cgroup/CG0/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG00/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG01/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG001/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG001/cpu.max
    
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    
    time wait;

---

The above script was stress tested with 1,2,4, and 8 groups in a 64vCPU
VM, and with upto 64 groups on a 3rd Generation EPYC system (2 x
64C/128T) without experiencing any obvious issues.

Since I had run into various issues initially with this prototype, I've
retained a debug patch towards the end that dumps the rbtree state when
pick_eevdf() fails to select a task. Since it is a debug patch, all the
checkpatch warnings and errors have been ignored. I apologize in advance
for any inconvenience there.

Broken bits and disclaimer
==========================

In ascending order of priority:

- Few SCHED_WARN_ON() added assume only the syscall boundary is a
  critical section. These can go if this infrastructure is used for more
  fine-grained deferral.

- !CONFIG_CFS_BANDWIDTH and !CONFIG_SCHED_AUTOGROUP has only been build
  tested. Most (almost all) of the testing was done with
  CONFIG_CFS_BANDWIDTH enabled on a x86 system and a VM.

- The current PoC only support achitectures that select GENERIC_ENTRY.
  Other architectures need to use the sched_notify_critical_section*()
  API to mark the kernel critical section boundaries.

- Some abstractions are not very clean. For example, generic layer uses
  se_in_kernel() often passing &p->se as the argument. Instead, the
  task_struct can be sent directly to a wrapper around se_in_kernel().

- There is currently no way to opt out of throttle deferral however, it
  is possible to introduce a sysctl hook to enable / disable the feature
  on demand by only allowing propagation of "kernel_cs_count" down the
  cgroup hierarchy when the feature is enabled and a static key to guard
  key decision points that can lead to partial throttle.A simpler option
  is a boot time switch.

- On a partially throttled hierarchy, only kernel mode preempted tasks
  and the entities that lead to them can make progress. This can
  theoretically lead them to accumulate unbounded amount of -ve lag
  which can lead to long wait times for these entities when the
  hierarchy is unthrottled.

- PELT requires taking some look at since the number of runnable
  entities on a throttled hierarchy is not the same as "h_nr_runnable".
  Audit all references to "cfs_rq->h_nr_runnable" and "se->runnable_sum"
  and address those paths accordingly.

- Partially throttled hierarchies do not set the hierarchical indicators
  namely "cfs_rq->throttled_count" anymore. Audit all uses of
  throttled_hierarchy() to see if they need adjusting. At least two
  cases, one in throttled_lb_pair(), and the other in
  class->task_is_throttled() callback used by core scheduling requires
  auditing.

Clarifications required
=======================

- A kernel thread on a throttled hierarchy will always partially
  unthrottle it and make itself runnable. Is this possible? (at least
  one case with osnoise was found) Is this desired?

- A new task always starts off in kernel mode and can unthrottle the
  hierarchy to make itself runnable. Is this a desired behavior?

- There are a few "XXX" and "TODO" along that way that I'll be
  revisiting. Any comments there is greatly appreciated.

- Patch 21 seems to solve a heap integrity issue that I have
  encountered. Late into stress testing I occasionally came across a
  scenario where the rbroot does not have the correct min_kcs_vruntime
  set. An example dump is as follows:

    (Sorry in advance for the size of this dump)

    CPU43: ----- current task ----
    CPU43: pid(0) comm(swapper/43) task_cpu(43) task_on_rq_queued(1) task_on_rq_migrating(0) normal_policy(1) idle_policy(0)
    CPU43: ----- current task done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(1) h_nr_queued(24) h_nr_runnable(24) nr_queued(24) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(3043328) avg_load(24576) avg_kcs_vruntime(125952) avg_kcs_load(1024)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- rbtree traversal: root ----
    CPU43: se: load(1024) vruntime(6057209863) entity_key(12716) deadline(6059998121) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057198486) entity_key(1339) deadline(6059991924) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057128111) entity_key(-69036) deadline(6059919846) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202668) entity_key(5521) deadline(6059894986) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057124320) entity_key(-72827) deadline(6059915644) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057196261) entity_key(-886) deadline(6059984139) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057176590) entity_key(-20557) deadline(6059967663) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057198406) entity_key(1259) deadline(6059990321) min_vruntime(6057198406) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057206114) entity_key(8967) deadline(6059995624) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202212) entity_key(5065) deadline(6059994879) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057201766) entity_key(4619) deadline(6059993029) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057197431) entity_key(284) deadline(6059992361) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201845) entity_key(4698) deadline(6059994381) min_vruntime(6057201845) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201956) entity_key(4809) deadline(6059995233) min_vruntime(6057201956) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057203193) entity_key(6046) deadline(6059996320) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057199892) entity_key(2745) deadline(6059995624) min_vruntime(6057199892) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057205914) entity_key(8767) deadline(6059996907) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057197270) entity_key(123) deadline(6059997270) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(1) min_kcs_vruntime(6057197270) pick_entity(1)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057210834) entity_key(13687) deadline(6060002729) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057209600) entity_key(12453) deadline(6060000733) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215347) entity_key(18200) deadline(6060005989) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057215103) entity_key(17956) deadline(6060004643) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215276) entity_key(18129) deadline(6060007381) min_vruntime(6057215276) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057216042) entity_key(18895) deadline(6060008258) min_vruntime(6057216042) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- rbtree done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(312) avg_kcs_vruntime(0) avg_kcs_load(312)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(206) avg_kcs_vruntime(0) avg_kcs_load(206)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(0) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(-1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(156) avg_kcs_vruntime(0) avg_kcs_load(156)
    CPU43: ----- cfs_rq done ----

  The vruntime of entity corresponding to "pick_entity(1)" should have
  been propagated to the root as min_kcs_vruntime but that does not seem
  to happen. This happens unpredictably and the amount of logs are
  overwhelming to make any sense of which series of rbtree operations
  lead up to it but grouping all the min_* members and setting it as
  RBAUGMENT as implemented in Patch 21 seem to make the issue go away.

  If folks think this is a genuine bugfix and not me masking a bigger
  issue, I can send this fix separately for min_vruntime and min_slice
  propagation and rebase my series on top of it.

Credits
=======

This series is based on the former approaches from:

Valentin Schneider <vschneid@...hat.com>
Ben Segall <bsegall@...gle.com>

All the credit for this idea really is due to them - while the mistakes
in implementing it are likely mine :)

Thanks to Swapnil, Dhananjay, Ravi, and Gautham for helping devise the
test strategy, discussions around the intricacies of CPU controllers,
and their help at tackling multiple subtle bugs I encountered along the
way.

References
==========

[1]: https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
[2]: https://youtu.be/YAVxQKQABLw
[3]: https://youtu.be/_N-nXJHiDNo
[4]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com

---
K Prateek Nayak (22):
  kernel/entry/common: Move syscall_enter_from_user_mode_work() out of
    header
  sched/fair: Convert "se->runnable_weight" to unsigned int and pack the
    struct
  [PoC] kernel/entry/common: Mark syscall as a kernel critical section
  [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks
  sched/fair: Track EEVDF stats for entities preempted in kernel mode
  sched/fair: Propagate the min_vruntime of kernel mode preempted entity
  sched/fair: Propagate preempted entity information up cgroup hierarchy
  sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled
    hierarchy
  sched/fair: Introduce cfs_rq throttled states in preparation for
    partial throttling
  sched/fair: Prepare throttle_cfs_rq() to allow partial throttling
  sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status
  sched/fair: Prepare bandwidth distribution to unthrottle partial
    throttles right away
  sched/fair: Correct the throttle status supplied to pick_eevdf()
  sched/fair: Mark a task if it was picked on a partially throttled
    hierarchy
  sched/fair: Call resched_curr() from sched_notify_syscall_exit()
  sched/fair: Prepare enqueue to partially unthrottle cfs_rq
  sched/fair: Clear pick on throttled indicator when task leave fair
    class
  sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled
    hierarchy
  sched/fair: Ignore in-kernel indicators for running task outside of
    schedule()
  sched/fair: Implement determine_throttle_state() for partial throttle
  [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together
    for propagation
  [DEBUG] sched/fair: Debug pick_eevdf() returning NULL!

 include/linux/entry-common.h |  10 +-
 include/linux/sched.h        |  45 +-
 init/init_task.c             |   5 +-
 kernel/entry/common.c        |  17 +
 kernel/entry/common.h        |   4 +
 kernel/sched/core.c          |   8 +-
 kernel/sched/debug.c         |   2 +-
 kernel/sched/fair.c          | 935 ++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h         |   8 +-
 9 files changed, 950 insertions(+), 84 deletions(-)


base-commit: d34e798094ca7be935b629a42f8b237d4d5b7f1d
-- 
2.43.0