lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cfcea236-5b4c-4037-a6f5-267c4c04ad3c@nvidia.com>
Date: Mon, 3 Feb 2025 11:01:51 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Juri Lelli <juri.lelli@...hat.com>
Cc: Thierry Reding <treding@...dia.com>, Waiman Long <longman@...hat.com>,
 Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
 Michal Koutny <mkoutny@...e.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Phil Auld <pauld@...hat.com>, Qais Yousef <qyousef@...alina.io>,
 Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
 "Joel Fernandes (Google)" <joel@...lfernandes.org>,
 Suleiman Souhlal <suleiman@...gle.com>, Aashish Sharma <shraash@...gle.com>,
 Shin Kawamura <kawasin@...gle.com>,
 Vineeth Remanan Pillai <vineeth@...byteword.org>,
 linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH v2 3/2] sched/deadline: Check bandwidth overflow earlier
 for hotplug

Hi Juri,

On 16/01/2025 15:55, Juri Lelli wrote:
> On 16/01/25 13:14, Jon Hunter wrote:
>>
>> On 15/01/2025 16:10, Juri Lelli wrote:
>>> On 14/01/25 15:02, Juri Lelli wrote:
>>>> On 14/01/25 13:52, Jon Hunter wrote:
>>>>>
>>>>> On 13/01/2025 09:32, Juri Lelli wrote:
>>>>>> On 10/01/25 18:40, Jon Hunter wrote:
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> With the above I see the following ...
>>>>>>>
>>>>>>> [   53.919672] dl_bw_manage: cpu=5 cap=3072 fair_server_bw=52428 total_bw=209712 dl_bw_cpus=4
>>>>>>> [   53.930608] dl_bw_manage: cpu=4 cap=2048 fair_server_bw=52428 total_bw=157284 dl_bw_cpus=3
>>>>>>> [   53.941601] dl_bw_manage: cpu=3 cap=1024 fair_server_bw=52428 total_bw=104856 dl_bw_cpus=2
>>>>>>
>>>>>> So far so good.
>>>>>>
>>>>>>> [   53.952186] dl_bw_manage: cpu=2 cap=1024 fair_server_bw=52428 total_bw=576708 dl_bw_cpus=2
>>>>>>
>>>>>> But, this above doesn't sound right.
>>>>>>
>>>>>>> [   53.962938] dl_bw_manage: cpu=1 cap=0 fair_server_bw=52428 total_bw=576708 dl_bw_cpus=1
>>>>>>> [   53.971068] Error taking CPU1 down: -16
>>>>>>> [   53.974912] Non-boot CPUs are not disabled
>>>>>>
>>>>>> What is the topology of your board?
>>>>>>
>>>>>> Are you using any cpuset configuration for partitioning CPUs?
>>>>>
>>>>>
>>>>> I just noticed that by default we do boot this board with 'isolcpus=1-2'. I
>>>>> see that this is a deprecated cmdline argument now and I must admit I don't
>>>>> know the history of this for this specific board. It is quite old now.
>>>>>
>>>>> Thierry, I am curious if you have this set for Tegra186 or not? Looks like
>>>>> our BSP (r35 based) sets this by default.
>>>>>
>>>>> I did try removing this and that does appear to fix it.
>>>>
>>>> OK, good.
>>>>
>>>>> Juri, let me know your thoughts.
>>>>
>>>> Thanks for the additional info. I guess I could now try to repro using
>>>> isolcpus at boot on systems I have access to (to possibly understand
>>>> what the underlying problem is).
>>>
>>> I think the problem lies in the def_root_domain accounting of dl_servers
>>> (which isolated cpus remains attached to).
>>>
>>> Came up with the following, of which I'm not yet fully convinced, but
>>> could you please try it out on top of the debug patch and see how it
>>> does with the original failing setup using isolcpus?
>>
>>
>> Thanks I added the change, but suspend is still failing with this ...
> 
> Thanks!
> 
>> [  210.595431] dl_bw_manage: cpu=5 cap=3072 fair_server_bw=52428 total_bw=209712 dl_bw_cpus=4
>> [  210.606269] dl_bw_manage: cpu=4 cap=2048 fair_server_bw=52428 total_bw=157284 dl_bw_cpus=3
>> [  210.617281] dl_bw_manage: cpu=3 cap=1024 fair_server_bw=52428 total_bw=104856 dl_bw_cpus=2
>> [  210.627205] dl_bw_manage: cpu=2 cap=1024 fair_server_bw=52428 total_bw=262140 dl_bw_cpus=2
>> [  210.637752] dl_bw_manage: cpu=1 cap=0 fair_server_bw=52428 total_bw=262140 dl_bw_cpus=1
>                                                                            ^
> Different than before but still not what I expected. Looks like there
> are conditions/path I currently cannot replicate on my setup, so more
> thinking. Unfortunately I will be out traveling next week, so this
> might required a bit of time.


I see that this is now in the mainline and our board is still failing to 
suspend. Let me know if there is anything else you need me to test.

Thanks
Jon

-- 
nvpublic


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ