[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com>
Date: Thu, 5 Sep 2024 16:07:01 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Hongyan Xia <hongyan.xia2@....com>, Luis Machado <luis.machado@....com>,
Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
juri.lelli@...hat.com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
kprateek.nayak@....com, wuyun.abel@...edance.com, youssefesmat@...omium.org,
tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue
On 05/09/2024 15:33, Vincent Guittot wrote:
> On Thu, 5 Sept 2024 at 15:02, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>
>> On 29/08/2024 17:42, Hongyan Xia wrote:
>>> On 22/08/2024 15:58, Vincent Guittot wrote:
>>>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>>>> <vincent.guittot@...aro.org> wrote:
>>>>>
>>>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@....com> wrote:
>>>>>>
>>>>>> Vincent,
>>>>>>
>>>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@....com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@....com>
[...]
>>> After staring at the code even more, I think the situation is worse.
>>>
>>> First thing is that uclamp might also want to be part of these stats
>>> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
>>> delayed dequeue which needs to be specially handled in the same way. The
>>> current way of handling uclamp in core.c misses the frequency update,
>>> like I commented before.
>>>
>>> Second, there is also an out-of-sync issue in update_load_avg(). We only
>>> update the task-level se in delayed dequeue and requeue, but we return
>>> early and the upper levels are completely skipped, as if the delayed
>>> task is still on rq. This de-sync is wrong.
>>
>> I had a look at the util_est issue.
>>
>> This keeps rq->cfs.avg.util_avg sane for me with
>> SCHED_FEAT(DELAY_DEQUEUE, true):
>>
>> -->8--
>>
>> From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
>> From: Dietmar Eggemann <dietmar.eggemann@....com>
>> Date: Thu, 5 Sep 2024 00:05:23 +0200
>> Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>>
>> Remove delayed tasks from util_est even they are runnable.
>
> Unfortunately, this is not only about util_est
>
> cfs_rq's runnable_avg is also wrong because we normally have :
> cfs_rq's runnable_avg == /Sum se's runnable_avg
> but cfs_rq's runnable_avg uses cfs_rq's h_nr_running but delayed
> entities are still accounted in h_nr_running
Yes, I agree.
se's runnable_avg should be fine already since:
se_runnable()
if (se->sched_delayed)
return false
But then, like you said, __update_load_avg_cfs_rq() needs correct
cfs_rq->h_nr_running.
And I guess we need something like:
se_on_rq()
if (se->sched_delayed)
return false
for
__update_load_avg_se()
- if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
+ if (___update_load_sum(now, &se->avg, se_on_rq(se), se_runnable(se),
My hope was we can fix util_est independently since it drives CPU
frequency. Whereas PELT load_avg and runnable_avg are "only" used for
load balancing. But I agree, it has to be fixed as well.
> That also means that cfs_rq's h_nr_running is not accurate anymore
> because it includes delayed dequeue
+1
> and cfs_rq load_avg is kept artificially high which biases
> load_balance and cgroup's shares
+1
>> Exclude delayed task which are (a) migrating between rq's or (b) in a
>> SAVE/RESTORE dequeue/enqueue.
I just realized that this fixes the uneven util_est_dequeue/enqueue
calls so we don't see the underflow depicted by Hongyan and no massive
rq->cfs util_est due to missing ue_dequeues.
But delayed tasks are part of rq->cfs util_est, not excluded. Let me fix
that.
>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@....com>
>> ---
>> kernel/sched/fair.c | 16 +++++++++-------
>> 1 file changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1e693ca8ebd6..5c32cc26d6c2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> int rq_h_nr_running = rq->cfs.h_nr_running;
>> u64 slice = 0;
>>
>> - if (flags & ENQUEUE_DELAYED) {
>> - requeue_delayed_entity(se);
>> - return;
>> - }
>> -
>> /*
>> * The code below (indirectly) updates schedutil which looks at
>> * the cfs_rq utilization to select a frequency.
>> * Let's add the task's estimated utilization to the cfs_rq's
>> * estimated utilization, before we update schedutil.
>> */
>> - util_est_enqueue(&rq->cfs, p);
>> + if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
>> + util_est_enqueue(&rq->cfs, p);
>> +
>> + if (flags & ENQUEUE_DELAYED) {
>> + requeue_delayed_entity(se);
>> + return;
>> + }
>>
>> /*
>> * If in_iowait is set, the code below may not trigger any cpufreq
>> @@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>> */
>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> {
>> - util_est_dequeue(&rq->cfs, p);
>> + if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>> + util_est_dequeue(&rq->cfs, p);
>>
>> if (dequeue_entities(rq, &p->se, flags) < 0) {
>> if (!rq->cfs.h_nr_running)
>> --
>> 2.34.1
Powered by blists - more mailing lists