linux-kernel - Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b2386349-2879-4dce-afb0-f6486c24c117@amd.com>
Date: Thu, 20 Feb 2025 16:48:58 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Valentin Schneider
	<vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>, Thomas Gleixner
	<tglx@...utronix.de>, Andy Lutomirski <luto@...nel.org>,
	<linux-kernel@...r.kernel.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	"Sebastian Andrzej Siewior" <bigeasy@...utronix.de>, Clark Williams
	<clrkwllms@...nel.org>, <linux-rt-devel@...ts.linux.dev>, Tejun Heo
	<tj@...nel.org>, Frederic Weisbecker <frederic@...nel.org>, Barret Rhoden
	<brho@...gle.com>, Petr Mladek <pmladek@...e.com>, Josh Don
	<joshdon@...gle.com>, Qais Yousef <qyousef@...alina.io>, "Paul E. McKenney"
	<paulmck@...nel.org>, David Vernet <dvernet@...a.com>, "Gautham R. Shenoy"
	<gautham.shenoy@....com>, Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

Hello Peter,

On 2/20/2025 4:25 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
>> Proposed approach
>> =================
>>
>> This approach builds on Ben Segall's proposal in [4] which marked the
>> task in schedule() when exiting to usermode by setting
>> "in_return_to_user" flag except this prototype takes it a step ahead and
>> marks a "kernel critical section" within the syscall boundary using a
>> per-task "kernel_cs_count".
>>
>> The rationale behind this approach is that the task can only hold
>> kernel resources when running in kernel mode in preemptible context. In
>> this POC, the entire syscall boundary is marked as a kernel critical
>> section but in the future, the API can be used to mark fine grained
>> boundaries like between an up_read(), down_read() or up_write(),
>> down_write() pair.
>>
>> Within a kernel critical section, throttling events are deferred until
>> the task's "kernel_cs_count" hits 0. Currently this count is an integer
>> to catch any cases where the count turns negative as a result of
>> oversights on my part but this could be changed to a preempt count like
>> mechanism to request a resched.
>>
>>              cfs_rq throttled               picked again
>> 		  v                              v
>>
>>      ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
>>
>>          ^                                                    ^
>> critical section   cfs_rq is throttled partially     critical section
>>        entry           since the task is still              exit
>>                    runnable as it was preempted in
>>                        kernel critical section
>>
>> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
>> avg_load of only those entities preempted in kernel mode. When a cfs_rq
>> is throttled, it uses these metrics to select among the kernel mode
>> preempted tasks and running them till they exit to user mode.
>> pick_eevdf() is made aware that it is operating on a throttled hierarchy
>> to only select among these tasks that were preempted in kernel mode (and
>> the sched entities of cfs_rq that lead to them). When a throttled
>> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
>> the hierarchy remains accessible for picking until that point.
>>
>>            root
>>          /     \
>>         A       B * (throttled)
>>        ...    / | \
>>              0  1* 2*
>>
>>      (*) Preempted in kernel mode
>>
>>    o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>>    o avg_kcs_load = load(1) + load(2)
>>
>>    o throttled_vruntime_eligible:
>>
>>        entity preempted in kernel mode &&
>>        entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
>>
>>    o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>>      that propagates the smallest vruntime of the all the entities in
>>      the subtree that are preempted in kernel mode. If they were
>>      executing in usermode when preempted, this will be set to LLONG_MAX.
>>
>>      This is used to construct a min-heap and select through the
>>      entities. Consider rbtree of B to look like:
>>
>>           1*
>>         /   \
>>        2*    0
>>
>>        min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>>        min_kcs_vruntime = min(se->min_kcs_vruntime,
>>                               __node_2_se(rb_left)->min_kcs_vruntime,
>>                               __node_2_se(rb_right)->min_kcs_vruntime);
>>
>>     pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>>     tree to first check the left subtree for eligibility, then the node
>>     itself, and then the right subtree.
>>
> 
> *groan*... why not actually dequeue the tasks and only retain those with
> non-zero cs_count? That avoids having to duplicate everything, no?

The rationale there was with growing number of tasks on cfs_rq, the
throttle path has to perform a lot of dequeues and the unthrottle at
distribution has to enqueue all the dequeued threads back.

This is one way to keep all the tasks queued but allow pick to only
select among those that are preempted in kernel mode.

Since per-task throttling needs to tag, dequeue, and re-enqueue each
task, I'm putting this out as an alternate approach that does not
increase the complexities of tg_tree walks which Ben had noted on
Valentin's series [1]. Instead we retain the per cfs_rq throttling
at the cost of some stats tracking at enqueue and dequeue
boundaries.

If you have a strong feelings against any specific part, or the entirety
of this approach, please do let me know, and I'll do my best to see if
a tweaked approach or an alternate implementation can scale well with
growing thread counts (or at least try to defend the bits in question if
they hold merit still).

Any and all feedback is appreciated :)

[1] https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/

-- 
Thanks and Regards,
Prateek