[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18249a21-7aec-4a66-bc5a-3aa077c2b190@gmail.com>
Date: Wed, 10 Jan 2024 08:09:18 +0100
From: Heiner Kallweit <hkallweit1@...il.com>
To: Pavan Chebbi <pavan.chebbi@...adcom.com>
Cc: Andrea Fois <andrea.fois@...ntsense.it>, Michael Chan
<mchan@...adcom.com>, "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, George Shuklin <george.shuklin@...il.com>,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] tg3: add new module param to force device power down on
reboot
On 10.01.2024 05:12, Pavan Chebbi wrote:
> On Wed, Jan 10, 2024 at 2:01 AM Heiner Kallweit <hkallweit1@...il.com> wrote:
>>
>> On 09.01.2024 20:45, Andrea Fois wrote:
>>> The bug #1917471 was fixed in commit 2ca1c94ce0b6 ("tg3: Disable tg3
>>> device on system reboot to avoid triggering AER") but was reintroduced
>>> by commit 9fc3bc764334 ("tg3: power down device only on
>>> SYSTEM_POWER_OFF").
>>>
>>> The problem described in #1917471 is still consistently replicable on
>>> reboots on Dell Servers (i.e. R750xs with BCM5720 LOM), causing NMIs
>>> (i.e. NMI received for unknown reason 38 on cpu 0) after 9fc3bc764334
>>> was committed.
>>>
>>> The problem is detected also by the Lifecycle controller and logged as
>>> a PCI Bus Error for the device.
>>>
>>> As the problems addressed by 2ca1c94ce0b6 and by 9fc3bc764334 requires
>>> opposite strategies, a new module param "force_pwr_down_on_reboot"
>>> <bool> is introduced to fix both scenarios:
>>>
>> Adding module parameters is discouraged. What I see could try:
> Ack.
>>
>> - limit 9fc3bc764334 to the specific machine type mentioned in the
>> commit message (based DMI info)
>> - 2ca1c94ce0b6 performs two actions: power down tg3 and disable device
>> Based on the commit description disabling the device might be sufficient.
>
> I think the second suggestion could be a better solution. Helps to
> solve the issue 9fc3bc764334 is trying to fix.
> But I am not sure how easy it is to test. As I recall, Goerge was
> unable to reach out to the author of 2ca1c94ce0b6 when he wanted to
> test his patch for regression.
> We did discuss the risk of this regression.
> https://patchwork.kernel.org/project/netdevbpf/patch/20231101130418.44164-1-george.shuklin@gmail.com/
> Unfortunately, looks like it has come true :(
If the culprit is an unexpected MSI interrupt, then you may also address
this directly by disabling interrupts in the device. tg3_stop() may be
a candidate here.
Powered by blists - more mailing lists