linux-kernel - Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based throttle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <08d75be1-e8e4-434e-a9d8-6a4503043872@linux.dev>
Date: Mon, 31 Mar 2025 17:14:50 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Aaron Lu <ziqianlu@...edance.com>
Cc: Valentin Schneider <vschneid@...hat.com>, Ben Segall
 <bsegall@...gle.com>, K Prateek Nayak <kprateek.nayak@....com>,
 Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>,
 Ingo Molnar <mingo@...hat.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, linux-kernel@...r.kernel.org,
 Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
 Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
 throttle

On 2025/3/31 14:42, Aaron Lu wrote:
> Hi Chengming,
> 
> On Fri, Mar 14, 2025 at 07:07:10PM +0800, Chengming Zhou wrote:
>> On 2025/3/14 17:42, Aaron Lu wrote:
>>> On Fri, Mar 14, 2025 at 04:39:41PM +0800, Chengming Zhou wrote:
>>>> On 2025/3/13 15:21, Aaron Lu wrote:
>>>>> From: Valentin Schneider <vschneid@...hat.com>
>>>>>
>>>>> Once a cfs_rq gets throttled, for all tasks belonging to this cfs_rq,
>>>>> add a task work to them so that when those tasks return to user, the
>>>>> actual throttle/dequeue can happen.
>>>>>
>>>>> Note that since the throttle/dequeue always happens on a task basis when
>>>>> it returns to user, it's no longer necessary for check_cfs_rq_runtime()
>>>>> to return a value and pick_task_fair() acts differently according to that
>>>>> return value, so check_cfs_rq_runtime() is changed to not return a
>>>>> value.
>>>>
>>>> Previously with the per-cfs_rq throttling, we use update_curr() -> put() path
>>>> to throttle the cfs_rq and dequeue it from the cfs_rq tree.
>>>>
>>>> Now with your per-task throttling, maybe things can become simpler. That we
>>>> can just throttle_cfs_rq() (cfs_rq subtree) when curr accouting to mark these
>>>> throttled.
>>>
>>> Do I understand correctly that now in throttle_cfs_rq(), we just mark
>>> this hierarchy as throttled, but do not add any throttle work to these
>>> tasks in this hierarchy and leave the throttle work add job to pick
>>> time?
>>
>> Right, we can move throttle_cfs_rq() forward to the curr accouting time, which
>> just mark these throttled.
> 
> While preparing the next version, I found that if I move
> throttle_cfs_rq() to accounting time, like in __account_cfs_rq_runtime(),
> then it is possible on unthrottle path, the following can happen:
> unthrottle_cfs_rq() -> enqueue_task_fair() -> update_curr() ->
> account_cfs_rq_runtime() -> throttle_cfs_rq()...

Ah, right, then it's best to leave throttle_cfs_rq() where it is.

> 
> Initially I was confused why update_curr() can notice a non-null curr
> when this cfs_rq is being unthrottled but then I realized in this task
> based throttling model, it is possible some task woke up in that
> throttled cfs_rq and have cfs_rq->curr set and then cfs_rq gets
> unthrottled.
> 
> So I suppose I'll keep the existing way of marking a cfs_rq as
> throttled by calling check_cfs_rq_runtime() in the following two places:
> - in pick_task_fair(), so that the to-be-picked cfs_rq can be marked for
>    throttle;
> - in put_prev_entity() for prev runnable task's cfs_rq.
> 
>> And move setup_task_work() afterward to the pick task time, which make that task
>> dequeue when ret2user.
> 
> No problem for this part as far as my test goes :-)

Good to hear.

Thanks!

> 
> Thanks,
> Aaron
> 
>>>
>>>> Then then if we pick a task from a throttled cfs_rq subtree, we can setup task work
>>>> for it, so we don't botter with the delayed_dequeue task case that Prateek mentioned.
>>>
>>> If we add a check point in pick time, maybe we can also avoid the check
>>> in enqueue time. One thing I'm thinking is, for a task, it may be picked
>>> multiple times with only a single enqueue so if we do the check in pick,
>>> the overhead can be larger?
>>
>> As Prateek already mentioned, this check cost is negligeable.
>>
>>>
>>>> WDYT?
>>>
>>> Thanks for your suggestion. I'll try this approach and see how it turned
>>> out.