[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3833509d-e0c1-4fc4-874f-5a10fe3f1633@redhat.com>
Date: Thu, 7 Nov 2024 23:40:29 -0500
From: Waiman Long <llong@...hat.com>
To: Juri Lelli <juri.lelli@...hat.com>,
Joel Fernandes <joel@...lfernandes.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Suleiman Souhlal <suleiman@...gle.com>, Aashish Sharma <shraash@...gle.com>,
Shin Kawamura <kawasin@...gle.com>,
Vineeth Remanan Pillai <vineeth@...byteword.org>
Subject: Re: [PATCH] dl_server: Reset DL server params when rd changes
On 11/6/24 1:05 PM, Waiman Long wrote:
> On 11/6/24 11:08 AM, Juri Lelli wrote:
>> On 04/11/24 17:41, Joel Fernandes wrote:
>>> On Mon, Nov 04, 2024 at 11:54:36AM +0100, Juri Lelli wrote:
>> ...
>>
>>>> I added a printk in __dl_server_attach_root which is called after the
>>>> dynamic rd is built to transfer bandwidth to it.
>>>>
>>>> __dl_server_attach_root came with d741f297bceaf ("sched/fair: Fair
>>>> server interface"), do you have this change in your backport?
>>> You nailed it! Our 5.15 backport appears to be slightly older and is
>>> missing
>>> this from topology.c as you mentioned. Thanks for clarifying!
>>>
>>>
>>> /*
>>> * Because the rq is not a task, dl_add_task_root_domain()
>>> did not
>>> * move the fair server bw to the rd if it already started.
>>> * Add it now.
>>> */
>>> if (rq->fair_server.dl_server)
>>> __dl_server_attach_root(&rq->fair_server, rq);
>>>
>>>>> So if rd changes during boot initialization, the correct dl_bw has
>>>>> to be
>>>>> updated AFAICS. Also if cpusets are used, the rd for a CPU may
>>>>> change.
>>>> cpusets changes are something that I still need to double check. Will
>>>> do.
>>>>
>>> Sounds good, that would be good to verify.
>> So, I played a little bit with it and came up with a simple set of ops
>> that point out an issue (default fedora server install):
>>
>> echo Y >/sys/kernel/debug/sched/verbose
>>
>> echo +cpuset >/sys/fs/cgroup/cgroup.subtree_control
>>
>> echo 0-7 > /sys/fs/cgroup/user.slice/cpuset.cpus
>> echo 6-7 > /sys/fs/cgroup/user.slice/cpuset.cpus.exclusive
>> echo root >/sys/fs/cgroup/user.slice/cpuset.cpus.partition
>>
>> The domains are rebuilt correctly, but we end up with a null total_bw.
>>
>> The conditional call above takes care correctly of adding back dl_server
>> per-rq bandwidth when we pass from the single domain to the 2 exclusive
>> ones, but I noticed that we go through partition_sched_domains_locked()
>> twice for a single write of 'root' and the second one, since it's not
>> actually destroying/rebuilding anything, is resetting total_bw w/o
>> addition dl_server contribution back.
>>
>> Now, not completely sure why we need to go through partition_sched_
>> domains_locked() twice, as we have (it also looked like a pattern from
>> other call paths)
>>
>> update_prstate()
>> -> update_cpumasks_hier()
>> -> rebuild_sched_domains_locked() <- right at the end
>> -> update_partition_sd_lb()
>> -> rebuild_sched_domains_locked() <- right after the above call
>>
>> Removing the first call does indeed fix the issue and domains look OK,
>> but I'm pretty sure I'm missing all sort of details and corner cases.
>>
>> Waiman (now Cc-ed), maybe you can help here understanding why the two
>> back to back calls are needed?
>
> Thanks for letting me know about this case.
>
> I am aware that rebuild_sched_domains_locked() can be called more than
> once. I have addressed the hotplug case, but it can happen in some
> other corner cases as well. The problem with multiple
> rebuild_sched_domains_locked() calls is the fact that intermediate
> ones may be called where the internal states may not be consistent. I
> am going to work on a fix to this issue by making sure that
> rebuild_sched_domains_locked() will only be called once.
I am working on a set of cpuset patches to eliminate redundant
rebuild_sched_domains_locked() calls. However, my cpuset test script
fails after the change due to the presence of test cases where the only
CPU in a 1-cpu partition is being offlined. So I sent out a
sched/deadline patch [1] to work around this particular corner case.
[1]
https://lore.kernel.org/lkml/20241108042924.520458-1-longman@redhat.com/T/#u
Apparently, the null total_bw bug caused by multiple
rebuild_sched_domains_locked() calls masks this problem.
Anyway, I should be able to post the cpuset patch series next week after
further testing. Please review my sched/deadline patch to see if you are
OK with this minor change.
Cheers,
Longman
Powered by blists - more mailing lists