lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 11 Jan 2023 23:57:40 +0100
From:   Heiner Kallweit <hkallweit1@...il.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Jakub Kicinski <kuba@...nel.org>,
        David Miller <davem@...emloft.net>,
        Realtek linux nic maintainers <nic_swsd@...ltek.com>,
        Eric Dumazet <edumazet@...gle.com>,
        Paolo Abeni <pabeni@...hat.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Stephen Hemminger <stephen@...workplumber.org>
Subject: Re: [PATCH net-next resubmit v2] r8169: disable ASPM in case of tx
 timeout

On 11.01.2023 23:38, Alexander Duyck wrote:
> On Wed, Jan 11, 2023 at 12:17 PM Heiner Kallweit <hkallweit1@...il.com> wrote:
>>
>> On 11.01.2023 17:16, Alexander H Duyck wrote:
>>> On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
>>>> There are still single reports of systems where ASPM incompatibilities
>>>> cause tx timeouts. It's not clear whom to blame, so let's disable
>>>> ASPM in case of a tx timeout.
>>>>
>>>> v2:
>>>> - add one-time warning for informing the user
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@...il.com>
>>>
>>> >From past experience I have seen ASPM issues cause the device to
>>> disappear from the bus after failing to come out of L1. If that occurs
>>> this won't be able to recover after the timeout without resetting the
>>> bus itself. As such it may be necessary to disable the link states
>>> prior to using the device rather than waiting until after the error.
>>> That can be addressed in a follow-on patch if this doesn't resolve the
>>> issue.
>>>
>>
>> Interesting, reports about disappearing devices I haven't seen yet.
>> Symptoms I've seen differ, based on combination of more or less faulty
>> NIC chipset version, BIOS bugs, PCIe mainboard chipset.
>> Typically users experienced missed rx packets, tx timeouts or NIC lockups.
>> Disabling ASPM resulted in complaints of notebook users about reduced
>> system runtime on battery.
>> Meanwhile we found a good balance and reports about ASPM issues
>> became quite rare.
>> Just L1.2 still causes issues under load even with newer chipset versions,
>> therefore L1.2 is disabled per default.
> 
> Does your driver do any checking for MMIO failures on reads? Basically
> when the device disappears it should start returning ~0 on mmio reads.

Not yet, good idea.

> The device itself doesn't disappear, but it doesn't respond to
> requests anymore so it might be the "NIC lockups" case you mentioned.
> The Intel parts would disappear as they would trigger their "surprise
> removal" logic which would detach the netdevice. I have seen that
> issue on some platforms. It is kind of interesting when you can
> actually watch it happen as the issue was essentially a marginal PCIe
> connection so it would start out at x4, then renegotiate down with
> each ASPM L1 link bounce, and eventually it would end up at x1 before
> just dropping off the bus.
> 
> I agree pro-actively disabling ASPM is bad for power savings. So if
> this approach can resolve it then I am more than willing to give it a
> try. My main concern is if MMIO is already borked, updating the ASPM
> settings may not be enough to bring it back and it may require a
> secondary bus reset.

I'll think about how to further improve recovery in case of ASPM issues.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ