linux-kernel - Re: [PATCH] dl_server: Reset DL server params when rd changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5259772c-527b-4ab2-9606-2d1f0d93862a@redhat.com>
Date: Fri, 8 Nov 2024 22:30:41 -0500
From: Waiman Long <llong@...hat.com>
To: Juri Lelli <juri.lelli@...hat.com>,
 Joel Fernandes <joel@...lfernandes.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Suleiman Souhlal <suleiman@...gle.com>, Aashish Sharma <shraash@...gle.com>,
 Shin Kawamura <kawasin@...gle.com>,
 Vineeth Remanan Pillai <vineeth@...byteword.org>
Subject: Re: [PATCH] dl_server: Reset DL server params when rd changes

On 11/7/24 11:40 PM, Waiman Long wrote:
> On 11/6/24 1:05 PM, Waiman Long wrote:
>> On 11/6/24 11:08 AM, Juri Lelli wrote:
>>> On 04/11/24 17:41, Joel Fernandes wrote:
>>>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
>>> ...
>>>
>>>>> I added a printk in __dl_server_attach_root which is called after the
>>>>> dynamic rd is built to transfer bandwidth to it.
>>>>>
>>>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
>>>>> server interface"), do you have this change in your backport?
>>>> You nailed it! Our 5.15 backport appears to be slightly older and 
>>>> is missing
>>>> this from topology.c as you mentioned. Thanks for clarifying!
>>>>
>>>>
>>>>          /*
>>>>           * Because the rq is not a task, dl_add_task_root_domain() 
>>>> did not
>>>>           * move the fair server bw to the rd if it already started.
>>>>           * Add it now.
>>>>           */
>>>>          if (rq->fair_server.dl_server)
>>>> __dl_server_attach_root(&rq->fair_server, rq);
>>>>
>>>>>> So if rd changes during boot initialization, the correct dl_bw 
>>>>>> has to be
>>>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may 
>>>>>> change.
>>>>> cpusets changes are something that I still need to double check. Will
>>>>> do.
>>>>>
>>>> Sounds good, that would be good to verify.
>>> So, I played a little bit with it and came up with a simple set of ops
>>> that point out an issue (default fedora server install):
>>>
>>> echo Y >/sys/kernel/debug/sched/verbose
>>>
>>> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
>>>
>>> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
>>> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
>>> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
>>>
>>> The domains are rebuilt correctly, but we end up with a null total_bw.
>>>
>>> The conditional call above takes care correctly of adding back 
>>> dl_server
>>> per-rq bandwidth when we pass from the single domain to the 2 exclusive
>>> ones, but I noticed that we go through partition_sched_domains_locked()
>>> twice for a single write of 'root' and the second one, since it's not
>>> actually destroying/rebuilding anything, is resetting total_bw w/o
>>> addition dl_server contribution back.
>>>
>>> Now, not completely sure why we need to go through partition_sched_
>>> domains_locked() twice, as we have (it also looked like a pattern from
>>> other call paths)
>>>
>>> update_prstate()
>>> -> update_cpumasks_hier()
>>>     -> rebuild_sched_domains_locked() <- right at the end
>>> -> update_partition_sd_lb()
>>>     -> rebuild_sched_domains_locked() <- right after the above call
>>>
>>> Removing the first call does indeed fix the issue and domains look OK,
>>> but I'm pretty sure I'm missing all sort of details and corner cases.
>>>
>>> Waiman (now Cc-ed), maybe you can help here understanding why the two
>>> back to back calls are needed?
>>
>> Thanks for letting me know about this case.
>>
>> I am aware that rebuild_sched_domains_locked() can be called more 
>> than once. I have addressed the hotplug case, but it can happen in 
>> some other corner cases as well. The problem with multiple 
>> rebuild_sched_domains_locked() calls is the fact that intermediate 
>> ones may be called where the internal states may not be consistent. I 
>> am going to work on a fix to this issue by making sure that 
>> rebuild_sched_domains_locked() will only be called once.
>
> I am working on a set of cpuset patches to eliminate redundant 
> rebuild_sched_domains_locked() calls. However, my cpuset test script 
> fails after the change due to the presence of test cases where the 
> only CPU in a 1-cpu partition is being offlined. So I sent out a 
> sched/deadline patch [1] to work around this particular corner case.
>
> [1] 
> https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u
>
> Apparently, the null total_bw bug caused by multiple 
> rebuild_sched_domains_locked() calls masks this problem.
>
> Anyway, I should be able to post the cpuset patch series next week 
> after further testing. Please review my sched/deadline patch to see if 
> you are OK with this minor change.

I have the patchset to enforce that rebuild_sched_domains_locked() will 
only be called at most once per cpuset operation.

By adding some debug code to further study the null total_bw issue when 
cpuset.cpus.partition is being changed, I found that eliminating the 
redundant rebuild_sched_domains_locked() reduced the chance of hitting 
null total_bw, it did not eliminate it. By running my cpuset test 
script, I hit 250 cases of null total_bw with the v6.12-rc6 kernel. With 
my new cpuset patch applied, it reduces it to 120 cases of null total_bw.

I will try to look further for the exact condition that triggers null 
total_bw generation.

Cheers,
Longman