[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <370dee6c-919a-2f98-1404-a3feda14d1ba@candelatech.com>
Date: Tue, 30 Aug 2022 15:16:24 -0700
From: Ben Greear <greearb@...delatech.com>
To: Pali Rohár <pali@...nel.org>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, bjorn@...gaas.com,
LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org,
Stefan Roese <sr@...x.de>, Bjorn Helgaas <bhelgaas@...gle.com>,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Bharat Kumar Gogada <bharat.kumar.gogada@...inx.com>,
Michal Simek <michal.simek@...inx.com>,
Yao Hongbo <yaohongbo@...ux.alibaba.com>,
Naveen Naidu <naveennaidu479@...il.com>,
Sasha Levin <sashal@...nel.org>
Subject: Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in
get_port_device_capability()
On 8/30/22 2:55 PM, Pali Rohár wrote:
> On Tuesday 30 August 2022 14:28:14 Ben Greear wrote:
>> On 8/30/22 1:58 PM, Pali Rohár wrote:
>>> On Tuesday 30 August 2022 13:47:48 Ben Greear wrote:
>>>> On 8/23/22 11:41 PM, Greg Kroah-Hartman wrote:
>>>>> On Tue, Aug 23, 2022 at 07:20:14AM -0500, Bjorn Helgaas wrote:
>>>>>> On Tue, Aug 23, 2022, 6:35 AM Greg Kroah-Hartman <gregkh@...uxfoundation.org>
>>>>>> wrote:
>>>>>>
>>>>>>> From: Stefan Roese <sr@...x.de>
>>>>>>>
>>>>>>> [ Upstream commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 ]
>>>>>>>
>>>>>>
>>>>>> There's an open regression related to this commit:
>>>>>>
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216373
>>>>>
>>>>> This is already in the following released stable kernels:
>>>>> 5.10.137 5.15.61 5.18.18 5.19.2
>>>>>
>>>>> I'll go drop it from the 4.19 and 5.4 queues, but when this gets
>>>>> resolved in Linus's tree, make sure there's a cc: stable on the fix so
>>>>> that we know to backport it to the above branches as well. Or at the
>>>>> least, a "Fixes:" tag.
>>>>
>>>> This is still in 5.19.5. We saw some funny iwlwifi crashes in 5.19.3+
>>>> that we did not see in 5.19.0+. I just bisected the scary looking AER errors to this
>>>> patch, though I do not know for certain if it causes the iwlwifi related crashes yet.
>>>>
>>>> In general, from reading the commit msg, this patch doesn't seem to be a great candidate
>>>> for stable in general. Does it fix some important problem?
>>>>
>>>> In case it helps, here is example of what I see in dmesg. The kernel crashes in iwlwifi
>>>> had to do with rx messages from the firmware, and some warnings lead me to believe that
>>>> pci messages were slow coming back and/or maybe duplicated. So maybe this AER patch changes
>>>> timing or otherwise screws up the PCI adapter boards we use...
>>>
>>> From that log I have feeling that issue is in that intel wifi card and
>>> it was there also before that commit. Card is crashing (or something
>>> other happens on PCIe bus) and because kernel had disabled Error
>>> Reporting for this card, nobody spotted any issue. And that commit just
>>> opened eye to kernel to see those errors.
>>>
>>> I think this issue should be reported to intel wifi card developers,
>>> maybe they comment it, why card is reporting errors.
>>
>> My main concern is not that AER messages started showing up, but that there
>> started being kernel NPE and WARNINGS showing up sometime after 5.19.0.
>>
>> Possibly this AER thing is mis-direction and the real bug is elsewhere,
>> but since the bugzilla also indicated (different) driver crashes, then
>> I am suspicious this changes things more significantly, at least in a subset
>> of hardware out there.
>
> Yea, of course, this is something needed to investigate.
>
> Anyway, do you see driver crashes? Or just these AER errors? And are
> your PCIe cards working, or after seeing these messages in dmesg they
> stopped working? It is needed to know if you are just spammed by tons of
> lines in dmesg and otherwise everything works. Or if after AER errors
> your PCIe devices stop working and rebooting system is required.
We did see higher frequency of weird crashes (accessing null-ish pointer) after upgrading to 5.19.3,
I am building kernel now with 5.19.5 and that AER patch reverted. We will
test to see if that solves the crashes.
>> Also, any idea what this error in my logs is actually indicating?
>
> Your PCIe controller received non-fatal, but uncorrected error. There is
> also indication of Unsupported Request Completion Status. Unsupported
> Request is generated by PCIe device when controller / host / kernel try
> to do something which is not supported by device; pretty generic error.
> PCIe base spec describe lot of scenarios when card should return this
> error. Maybe some more detailed information are in TLP Header hexdump,
> but I cannot decode it now.
>
> Basically it is PCIe card driver who could know how fatal it is that
> issue and how to recover from it. But as you can see intel wifi driver
> does not implement that callback.
Odds of me getting a good answer on that are pretty small.
Thanks,
Ben
--
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc http://www.candelatech.com
Powered by blists - more mailing lists