linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250331064204.GB1571554@bytedance>
Date: Mon, 31 Mar 2025 14:42:04 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Peter Zijlstra <peterz@...radead.org>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	linux-kernel@...r.kernel.org, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

Hi Chengming,

On Fri, Mar 14, 2025 at 07:07:10PM +0800, Chengming Zhou wrote:
> On 2025/3/14 17:42, Aaron Lu wrote:
> > On Fri, Mar 14, 2025 at 04:39:41PM +0800, Chengming Zhou wrote:
> > > On 2025/3/13 15:21, Aaron Lu wrote:
> > > > From: Valentin Schneider <vschneid@...hat.com>
> > > > 
> > > > Once a cfs_rq gets throttled, for all tasks belonging to this cfs_rq,
> > > > add a task work to them so that when those tasks return to user, the
> > > > actual throttle/dequeue can happen.
> > > > 
> > > > Note that since the throttle/dequeue always happens on a task basis when
> > > > it returns to user, it's no longer necessary for check_cfs_rq_runtime()
> > > > to return a value and pick_task_fair() acts differently according to that
> > > > return value, so check_cfs_rq_runtime() is changed to not return a
> > > > value.
> > > 
> > > Previously with the per-cfs_rq throttling, we use update_curr() -> put() path
> > > to throttle the cfs_rq and dequeue it from the cfs_rq tree.
> > > 
> > > Now with your per-task throttling, maybe things can become simpler. That we
> > > can just throttle_cfs_rq() (cfs_rq subtree) when curr accouting to mark these
> > > throttled.
> > 
> > Do I understand correctly that now in throttle_cfs_rq(), we just mark
> > this hierarchy as throttled, but do not add any throttle work to these
> > tasks in this hierarchy and leave the throttle work add job to pick
> > time?
> 
> Right, we can move throttle_cfs_rq() forward to the curr accouting time, which
> just mark these throttled.

While preparing the next version, I found that if I move
throttle_cfs_rq() to accounting time, like in __account_cfs_rq_runtime(),
then it is possible on unthrottle path, the following can happen:
unthrottle_cfs_rq() -> enqueue_task_fair() -> update_curr() ->
account_cfs_rq_runtime() -> throttle_cfs_rq()...

Initially I was confused why update_curr() can notice a non-null curr
when this cfs_rq is being unthrottled but then I realized in this task
based throttling model, it is possible some task woke up in that
throttled cfs_rq and have cfs_rq->curr set and then cfs_rq gets
unthrottled.

So I suppose I'll keep the existing way of marking a cfs_rq as
throttled by calling check_cfs_rq_runtime() in the following two places:
- in pick_task_fair(), so that the to-be-picked cfs_rq can be marked for
  throttle;
- in put_prev_entity() for prev runnable task's cfs_rq.

> And move setup_task_work() afterward to the pick task time, which make that task
> dequeue when ret2user.

No problem for this part as far as my test goes :-)

Thanks,
Aaron

> > 
> > > Then then if we pick a task from a throttled cfs_rq subtree, we can setup task work
> > > for it, so we don't botter with the delayed_dequeue task case that Prateek mentioned.
> > 
> > If we add a check point in pick time, maybe we can also avoid the check
> > in enqueue time. One thing I'm thinking is, for a task, it may be picked
> > multiple times with only a single enqueue so if we do the check in pick,
> > the overhead can be larger?
> 
> As Prateek already mentioned, this check cost is negligeable.
> 
> > 
> > > WDYT?
> > 
> > Thanks for your suggestion. I'll try this approach and see how it turned
> > out.