[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <737edc45-5f56-4df5-b328-168980d744db@amd.com>
Date: Fri, 14 Mar 2025 14:30:52 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Aaron Lu <ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>, Josh Don <joshdon@...gle.com>, Ingo
Molnar <mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
<linux-kernel@...r.kernel.org>, Juri Lelli <juri.lelli@...hat.com>, Dietmar
Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Mel Gorman <mgorman@...e.de>, Chengming Zhou <chengming.zhou@...ux.dev>,
Chuyi Zhou <zhouchuyi@...edance.com>
Subject: Re: [RFC PATCH 2/7] sched/fair: Handle throttle path for task based
throttle
Hello Aaron,
On 3/14/2025 2:18 PM, Aaron Lu wrote:
>>> static int tg_throttle_down(struct task_group *tg, void *data)
>>> {
>>> struct rq *rq = data;
>>> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> + struct task_struct *p;
>>> + struct rb_node *node;
>>> +
>>> + cfs_rq->throttle_count++;
>>> + if (cfs_rq->throttle_count > 1)
>>> + return 0;
>>>
>>> /* group is entering throttled state, stop time */
>>> - if (!cfs_rq->throttle_count) {
>>> - cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>> - list_del_leaf_cfs_rq(cfs_rq);
>>> + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>>
>> Once cencern here is that the PELT is seemingly frozen despite the
>> hierarchy being runnable. I've still not tracked down whether it'll
>> cause any problems once unthrottled and all throttled time is negated
>> from the pelt clock but is there any concerns here?
>
> I chose to do it this way because:
> 1 I expect most of the time, if a task has to continue to run after its
> cfs_rq gets throttled, the time is relatively small so should not cause
> much impact. But I agree there can be times a task runs relatively long;
> 2 I think the original intent to freeze cfs_rq's pelt clock on throttle
> is so that on unthrottle, it can retore its loada(without its load being
> decayed etc.). If I chose to not freeze its pelt clock on throttle
> because some task is still running in kernel mode, since some of this
> cfs_rq's tasks are throttled, its load can become smaller and this can
> impact its load on unthrottle.
>
> I think both approach is not perfect, so I chose the simple one for now
> :) Not sure if my thinking is correct though.
>
>> Maybe this can be done at dequeue when cfs_rq->nr_queued on a
>> throttled_hierarchy() reached 0.
>
> Yes, this looks more consistent, maybe I should change to this approach.
I agree the time might be small in most cases but some syscalls with
enough contention in the system can take a while to exit to user mode.
Even I'm not sure what the correct approach is here - should a
subtree's PELT be frozen when the last task dequeues or should we
freeze it for the whole hierarchy once the throttled cfs_rq dequeues?
I'll wait for other folks to chime in since they know these bits
better than me.
>
>>> + list_del_leaf_cfs_rq(cfs_rq);
>>>
>>> - SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>> - if (cfs_rq->nr_queued)
>>> - cfs_rq->throttled_clock_self = rq_clock(rq);
>>> + SCHED_WARN_ON(cfs_rq->throttled_clock_self);
>>> + if (cfs_rq->nr_queued)
>>> + cfs_rq->throttled_clock_self = rq_clock(rq);
>>> +
>>> + WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
>>> + /*
>>> + * rq_lock is held, current is (obviously) executing this in kernelspace.
>>> + *
>>> + * All other tasks enqueued on this rq have their saved PC at the
>>> + * context switch, so they will go through the kernel before returning
>>> + * to userspace. Thus, there are no tasks-in-userspace to handle, just
>>> + * install the task_work on all of them.
>>> + */
>>> + node = rb_first(&cfs_rq->tasks_timeline.rb_root);
>>> + while (node) {
>>> + struct sched_entity *se = __node_2_se(node);
>>> +
>>> + if (!entity_is_task(se))
>>> + goto next;
>>> +
>>> + p = task_of(se);
>>> + task_throttle_setup_work(p);
>>> +next:
>>> + node = rb_next(node);
>>> + }
>>> +
>>> + /* curr is not in the timeline tree */
>>> + if (cfs_rq->curr && entity_is_task(cfs_rq->curr)) {
>>
>> I believe we can reach here from pick_next_task_fair() ->
>> check_cfs_rq_runtime() -> throttle_cfs_rq() in which case cfs_rq->curr
>> will still be set despite the task being blocked since put_prev_entity()
>> has not been called yet.
>>
>> I believe there should be a check for task_on_rq_queued() here for the
>> current task.
>
> Ah right, I'll see how to fix this.
It may not be necessary with the recent suggestion from Chengming where
you can just add the task work if the task was picked on a throttled
hierarchy.
--
Thanks and Regards,
Prateek
>
> Thanks,
> Aaron
>
>>> + p = task_of(cfs_rq->curr);
>>> + task_throttle_setup_work(p);
>>> }
>>> - cfs_rq->throttle_count++;
>>>
>>> return 0;
>>> }
>>>
>>
>> [..snip..]
>>
>> --
>> Thanks and Regards,
>> Prateek
>>
Powered by blists - more mailing lists