linux-kernel - Re: [RFC PATCH] sched: fair: reset task_group.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d8507f86-2458-4b01-a774-5102473e657e@oracle.com>
Date: Fri, 15 Dec 2023 19:59:52 +1000
From: Imran Khan <imran.f.khan@...cle.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched: fair: reset task_group.load_avg when there are
 no running tasks.

Hello Vincent,
Thanks a lot for having a look and getting back.

On 15/12/2023 7:11 pm, Vincent Guittot wrote:
> On Fri, 15 Dec 2023 at 06:27, Imran Khan <imran.f.khan@...cle.com> wrote:
>>
>> It has been found that sometimes a task_group has some residual
>> load_avg even though the load average at each of its owned queues
>> i.e task_group.cfs_rq[cpu].avg.load_avg and task_group.cfs_rq[cpu].
>> tg_load_avg_contrib have become 0 for a long time.
>> Under this scenario if another task starts running in this task_group,
>> it does not get proper time share on CPU since pre-existing
>> load average of task group inversely impacts the new task's CPU share
>> on each CPU.
>>
>> This change looks for the condition when a task_group has no running
>> tasks and sets the task_group's load average to 0 in such cases, so
>> that tasks that run in future under this task_group get the CPU time
>> in accordance with the current load.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@...cle.com>
>> ---
>>
> 
> [...]
> 
>>
>> 4. Now move systemd-udevd to one of these test groups, say test_group_1, and
>> perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the
>> host side.
> 
> Could it be the root cause of your problem ?
> 
> The cfs_rq->tg_load_avg_contrib of the 120 CPUs that have been plugged
> then unplugged,  have not been correctly removed from tg->load_avg. If
> the cfs_rq->tg_load_avg_contrib of the 4 remaining CPUs is 0 then
> tg->load_avg should be 0 too.
> 
Agree and this was my understanding as well. The issue only happens
with large number of CPUs. For example if I go from 4 to 8 and back to
4 , the issue does not happen and even if it happens the residual load
avg is very little.

> Could you track that the cfs_rq->tg_load_avg_contrib is correctly
> removed from tg->load_avg when you unplug the CPUs ? I can easily
> imagine that the rate limit can skip some update of tg- >load_avg
> while offlining the cpu
> 

I will try to trace it but just so you know this issue is happening on other
kernel versions (which don't have rate limit feature) as well. I started
with v4.14.x but have tested and found it on v5.4.x and v5.15.x as well.

Thanks,
Imran