lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250220105558.GJ34567@noisy.programming.kicks-ass.net>
Date: Thu, 20 Feb 2025 11:55:58 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Andy Lutomirski <luto@...nel.org>, linux-kernel@...r.kernel.org,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Clark Williams <clrkwllms@...nel.org>,
	linux-rt-devel@...ts.linux.dev, Tejun Heo <tj@...nel.org>,
	Frederic Weisbecker <frederic@...nel.org>,
	Barret Rhoden <brho@...gle.com>, Petr Mladek <pmladek@...e.com>,
	Josh Don <joshdon@...gle.com>, Qais Yousef <qyousef@...alina.io>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	David Vernet <dvernet@...a.com>,
	"Gautham R. Shenoy" <gautham.shenoy@....com>,
	Swapnil Sapkal <swapnil.sapkal@....com>
Subject: Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to
 user mode

On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
> Proposed approach
> =================
> 
> This approach builds on Ben Segall's proposal in [4] which marked the
> task in schedule() when exiting to usermode by setting
> "in_return_to_user" flag except this prototype takes it a step ahead and
> marks a "kernel critical section" within the syscall boundary using a
> per-task "kernel_cs_count".
> 
> The rationale behind this approach is that the task can only hold
> kernel resources when running in kernel mode in preemptible context. In
> this POC, the entire syscall boundary is marked as a kernel critical
> section but in the future, the API can be used to mark fine grained
> boundaries like between an up_read(), down_read() or up_write(),
> down_write() pair.
> 
> Within a kernel critical section, throttling events are deferred until
> the task's "kernel_cs_count" hits 0. Currently this count is an integer
> to catch any cases where the count turns negative as a result of
> oversights on my part but this could be changed to a preempt count like
> mechanism to request a resched.
> 
>             cfs_rq throttled               picked again
> 		  v                              v
> 
>     ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
> 
>         ^                                                    ^
> critical section   cfs_rq is throttled partially     critical section
>       entry           since the task is still              exit
>                   runnable as it was preempted in
>                       kernel critical section
> 
> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
> avg_load of only those entities preempted in kernel mode. When a cfs_rq
> is throttled, it uses these metrics to select among the kernel mode
> preempted tasks and running them till they exit to user mode.
> pick_eevdf() is made aware that it is operating on a throttled hierarchy
> to only select among these tasks that were preempted in kernel mode (and
> the sched entities of cfs_rq that lead to them). When a throttled
> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
> the hierarchy remains accessible for picking until that point.
> 
>           root
>         /     \
>        A       B * (throttled)
>       ...    / | \
>             0  1* 2*
> 
>     (*) Preempted in kernel mode
> 
>   o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>   o avg_kcs_load = load(1) + load(2)
> 
>   o throttled_vruntime_eligible:
> 
>       entity preempted in kernel mode &&
>       entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
> 
>   o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>     that propagates the smallest vruntime of the all the entities in
>     the subtree that are preempted in kernel mode. If they were
>     executing in usermode when preempted, this will be set to LLONG_MAX.
> 
>     This is used to construct a min-heap and select through the
>     entities. Consider rbtree of B to look like:
> 
>          1*
>        /   \
>       2*    0
> 
>       min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>       min_kcs_vruntime = min(se->min_kcs_vruntime,
>                              __node_2_se(rb_left)->min_kcs_vruntime,
>                              __node_2_se(rb_right)->min_kcs_vruntime);
> 
>    pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>    tree to first check the left subtree for eligibility, then the node
>    itself, and then the right subtree.
> 

*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ