[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20392974-d830-1ef5-9dd2-b83b82cfa263@gmail.com>
Date: Wed, 11 Jan 2023 23:57:40 +0100
From: Heiner Kallweit <hkallweit1@...il.com>
To: Alexander Duyck <alexander.duyck@...il.com>
Cc: Jakub Kicinski <kuba@...nel.org>,
David Miller <davem@...emloft.net>,
Realtek linux nic maintainers <nic_swsd@...ltek.com>,
Eric Dumazet <edumazet@...gle.com>,
Paolo Abeni <pabeni@...hat.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Stephen Hemminger <stephen@...workplumber.org>
Subject: Re: [PATCH net-next resubmit v2] r8169: disable ASPM in case of tx
timeout
On 11.01.2023 23:38, Alexander Duyck wrote:
> On Wed, Jan 11, 2023 at 12:17 PM Heiner Kallweit <hkallweit1@...il.com> wrote:
>>
>> On 11.01.2023 17:16, Alexander H Duyck wrote:
>>> On Tue, 2023-01-10 at 23:03 +0100, Heiner Kallweit wrote:
>>>> There are still single reports of systems where ASPM incompatibilities
>>>> cause tx timeouts. It's not clear whom to blame, so let's disable
>>>> ASPM in case of a tx timeout.
>>>>
>>>> v2:
>>>> - add one-time warning for informing the user
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallweit1@...il.com>
>>>
>>> >From past experience I have seen ASPM issues cause the device to
>>> disappear from the bus after failing to come out of L1. If that occurs
>>> this won't be able to recover after the timeout without resetting the
>>> bus itself. As such it may be necessary to disable the link states
>>> prior to using the device rather than waiting until after the error.
>>> That can be addressed in a follow-on patch if this doesn't resolve the
>>> issue.
>>>
>>
>> Interesting, reports about disappearing devices I haven't seen yet.
>> Symptoms I've seen differ, based on combination of more or less faulty
>> NIC chipset version, BIOS bugs, PCIe mainboard chipset.
>> Typically users experienced missed rx packets, tx timeouts or NIC lockups.
>> Disabling ASPM resulted in complaints of notebook users about reduced
>> system runtime on battery.
>> Meanwhile we found a good balance and reports about ASPM issues
>> became quite rare.
>> Just L1.2 still causes issues under load even with newer chipset versions,
>> therefore L1.2 is disabled per default.
>
> Does your driver do any checking for MMIO failures on reads? Basically
> when the device disappears it should start returning ~0 on mmio reads.
Not yet, good idea.
> The device itself doesn't disappear, but it doesn't respond to
> requests anymore so it might be the "NIC lockups" case you mentioned.
> The Intel parts would disappear as they would trigger their "surprise
> removal" logic which would detach the netdevice. I have seen that
> issue on some platforms. It is kind of interesting when you can
> actually watch it happen as the issue was essentially a marginal PCIe
> connection so it would start out at x4, then renegotiate down with
> each ASPM L1 link bounce, and eventually it would end up at x1 before
> just dropping off the bus.
>
> I agree pro-actively disabling ASPM is bad for power savings. So if
> this approach can resolve it then I am more than willing to give it a
> try. My main concern is if MMIO is already borked, updating the ASPM
> settings may not be enough to bring it back and it may require a
> secondary bus reset.
I'll think about how to further improve recovery in case of ASPM issues.
Powered by blists - more mailing lists