linux-kernel - Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c49ef5fe-a909-43f1-b02f-a765ab9cedbf@arm.com>
Date: Thu, 5 Sep 2024 15:02:44 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Hongyan Xia <hongyan.xia2@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Luis Machado <luis.machado@....com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 kprateek.nayak@....com, wuyun.abel@...edance.com, youssefesmat@...omium.org,
 tglx@...utronix.de, efault@....de
Subject: Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

On 29/08/2024 17:42, Hongyan Xia wrote:
> On 22/08/2024 15:58, Vincent Guittot wrote:
>> On Thu, 22 Aug 2024 at 14:10, Vincent Guittot
>> <vincent.guittot@...aro.org> wrote:
>>>
>>> On Thu, 22 Aug 2024 at 14:08, Luis Machado <luis.machado@....com> wrote:
>>>>
>>>> Vincent,
>>>>
>>>> On 8/22/24 11:28, Luis Machado wrote:
>>>>> On 8/22/24 10:53, Vincent Guittot wrote:
>>>>>> On Thu, 22 Aug 2024 at 11:22, Luis Machado <luis.machado@....com>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 8/22/24 09:19, Vincent Guittot wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On Wed, 21 Aug 2024 at 15:34, Hongyan Xia <hongyan.xia2@....com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> Sorry for bombarding this thread in the last couple of days.
>>>>>>>>> I'm seeing
>>>>>>>>> several issues in the latest tip/sched/core after these patches
>>>>>>>>> landed.
>>>>>>>>>
>>>>>>>>> What I'm now seeing seems to be an unbalanced call of util_est.
>>>>>>>>> First, I applied
>>>>>>>>
>>>>>>>> I also see a remaining util_est for idle rq because of an unbalance
>>>>>>>> call of util_est_enqueue|dequeue
>>>>>>>>
>>>>>>>
>>>>>>> I can confirm issues with the utilization values and frequencies
>>>>>>> being driven
>>>>>>> seemingly incorrectly, in particular for little cores.
>>>>>>>
>>>>>>> What I'm seeing with the stock series is high utilization values
>>>>>>> for some tasks
>>>>>>> and little cores having their frequencies maxed out for extended
>>>>>>> periods of
>>>>>>> time. Sometimes for 5+ or 10+ seconds, which is excessive as the
>>>>>>> cores are mostly
>>>>>>> idle. But whenever certain tasks get scheduled there, they have a
>>>>>>> very high util
>>>>>>> level and so the frequency is kept at max.
>>>>>>>
>>>>>>> As a consequence this drives up power usage.
>>>>>>>
>>>>>>> I gave Hongyan's draft fix a try and observed a much more
>>>>>>> reasonable behavior for
>>>>>>> the util numbers and frequencies being used for the little cores.
>>>>>>> With his fix,
>>>>>>> I can also see lower energy use for my specific benchmark.
>>>>>>
>>>>>> The main problem is that the util_est of a delayed dequeued task
>>>>>> remains on the rq and keeps the rq utilization high and as a result
>>>>>> the frequency higher than needed.
>>>>>>
>>>>>> The below seems to works for me and keep sync the enqueue/dequeue of
>>>>>> uti_test with the enqueue/dequeue of the task as if de dequeue was
>>>>>> not
>>>>>> delayed
>>>>>>
>>>>>> Another interest is that we will not try to migrate a delayed dequeue
>>>>>> sleeping task that doesn't actually impact the current load of the
>>>>>> cpu
>>>>>> and as a result will not help in the load balance. I haven't yet
>>>>>> fully
>>>>>> checked what would happen with hotplug
>>>>>
>>>>> Thanks. Those are good points. Let me go and try your patch.
>>>>
>>>> I gave your fix a try, but it seems to make things worse. It is
>>>> comparable
>>>> to the behavior we had before Peter added the uclamp imbalance fix,
>>>> so I
>>>> believe there is something incorrect there.
>>>
>>> we need to filter case where task are enqueued/dequeued several
>>> consecutive times. That's what I'm look now
>>
>> I just realize before that It's not only util_est but the h_nr_running
>> that keeps delayed tasks as well so all stats of the rq are biased:
>> h_nr_running, util_est, runnable avg and load_avg.
> 
> After staring at the code even more, I think the situation is worse.
> 
> First thing is that uclamp might also want to be part of these stats
> (h_nr_running, util_est, runnable_avg, load_avg) that do not follow
> delayed dequeue which needs to be specially handled in the same way. The
> current way of handling uclamp in core.c misses the frequency update,
> like I commented before.
> 
> Second, there is also an out-of-sync issue in update_load_avg(). We only
> update the task-level se in delayed dequeue and requeue, but we return
> early and the upper levels are completely skipped, as if the delayed
> task is still on rq. This de-sync is wrong.

I had a look at the util_est issue.

This keeps rq->cfs.avg.util_avg sane for me with
SCHED_FEAT(DELAY_DEQUEUE, true):

-->8--

>From 0d7e8d057f49a47e0f3f484ac7d41e047dccec38 Mon Sep 17 00:00:00 2001
From: Dietmar Eggemann <dietmar.eggemann@....com>
Date: Thu, 5 Sep 2024 00:05:23 +0200
Subject: [PATCH] kernel/sched: Fix util_est accounting for DELAY_DEQUEUE

Remove delayed tasks from util_est even they are runnable.

Exclude delayed task which are (a) migrating between rq's or (b) in a
SAVE/RESTORE dequeue/enqueue.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@....com>
---
 kernel/sched/fair.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1e693ca8ebd6..5c32cc26d6c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6948,18 +6948,19 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int rq_h_nr_running = rq->cfs.h_nr_running;
 	u64 slice = 0;
 
-	if (flags & ENQUEUE_DELAYED) {
-		requeue_delayed_entity(se);
-		return;
-	}
-
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
 	 * Let's add the task's estimated utilization to the cfs_rq's
 	 * estimated utilization, before we update schedutil.
 	 */
-	util_est_enqueue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
+		util_est_enqueue(&rq->cfs, p);
+
+	if (flags & ENQUEUE_DELAYED) {
+		requeue_delayed_entity(se);
+		return;
+	}
 
 	/*
 	 * If in_iowait is set, the code below may not trigger any cpufreq
@@ -7177,7 +7178,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
-	util_est_dequeue(&rq->cfs, p);
+	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+		util_est_dequeue(&rq->cfs, p);
 
 	if (dequeue_entities(rq, &p->se, flags) < 0) {
 		if (!rq->cfs.h_nr_running)
-- 
2.34.1