lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 30 Aug 2022 15:16:24 -0700
From:   Ben Greear <greearb@...delatech.com>
To:     Pali Rohár <pali@...nel.org>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>, bjorn@...gaas.com,
        LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org,
        Stefan Roese <sr@...x.de>, Bjorn Helgaas <bhelgaas@...gle.com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Bharat Kumar Gogada <bharat.kumar.gogada@...inx.com>,
        Michal Simek <michal.simek@...inx.com>,
        Yao Hongbo <yaohongbo@...ux.alibaba.com>,
        Naveen Naidu <naveennaidu479@...il.com>,
        Sasha Levin <sashal@...nel.org>
Subject: Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in
 get_port_device_capability()

On 8/30/22 2:55 PM, Pali Rohár wrote:
> On Tuesday 30 August 2022 14:28:14 Ben Greear wrote:
>> On 8/30/22 1:58 PM, Pali Rohár wrote:
>>> On Tuesday 30 August 2022 13:47:48 Ben Greear wrote:
>>>> On 8/23/22 11:41 PM, Greg Kroah-Hartman wrote:
>>>>> On Tue, Aug 23, 2022 at 07:20:14AM -0500, Bjorn Helgaas wrote:
>>>>>> On Tue, Aug 23, 2022, 6:35 AM Greg Kroah-Hartman <gregkh@...uxfoundation.org>
>>>>>> wrote:
>>>>>>
>>>>>>> From: Stefan Roese <sr@...x.de>
>>>>>>>
>>>>>>> [ Upstream commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 ]
>>>>>>>
>>>>>>
>>>>>> There's an open regression related to this commit:
>>>>>>
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216373
>>>>>
>>>>> This is already in the following released stable kernels:
>>>>> 	5.10.137 5.15.61 5.18.18 5.19.2
>>>>>
>>>>> I'll go drop it from the 4.19 and 5.4 queues, but when this gets
>>>>> resolved in Linus's tree, make sure there's a cc: stable on the fix so
>>>>> that we know to backport it to the above branches as well.  Or at the
>>>>> least, a "Fixes:" tag.
>>>>
>>>> This is still in 5.19.5.  We saw some funny iwlwifi crashes in 5.19.3+
>>>> that we did not see in 5.19.0+.  I just bisected the scary looking AER errors to this
>>>> patch, though I do not know for certain if it causes the iwlwifi related crashes yet.
>>>>
>>>> In general, from reading the commit msg, this patch doesn't seem to be a great candidate
>>>> for stable in general.  Does it fix some important problem?
>>>>
>>>> In case it helps, here is example of what I see in dmesg.  The kernel crashes in iwlwifi
>>>> had to do with rx messages from the firmware, and some warnings lead me to believe that
>>>> pci messages were slow coming back and/or maybe duplicated.  So maybe this AER patch changes
>>>> timing or otherwise screws up the PCI adapter boards we use...
>>>
>>>   From that log I have feeling that issue is in that intel wifi card and
>>> it was there also before that commit. Card is crashing (or something
>>> other happens on PCIe bus) and because kernel had disabled Error
>>> Reporting for this card, nobody spotted any issue. And that commit just
>>> opened eye to kernel to see those errors.
>>>
>>> I think this issue should be reported to intel wifi card developers,
>>> maybe they comment it, why card is reporting errors.
>>
>> My main concern is not that AER messages started showing up, but that there
>> started being kernel NPE and WARNINGS showing up sometime after 5.19.0.
>>
>> Possibly this AER thing is mis-direction and the real bug is elsewhere,
>> but since the bugzilla also indicated (different) driver crashes, then
>> I am suspicious this changes things more significantly, at least in a subset
>> of hardware out there.
> 
> Yea, of course, this is something needed to investigate.
> 
> Anyway, do you see driver crashes? Or just these AER errors? And are
> your PCIe cards working, or after seeing these messages in dmesg they
> stopped working? It is needed to know if you are just spammed by tons of
> lines in dmesg and otherwise everything works. Or if after AER errors
> your PCIe devices stop working and rebooting system is required.

We did see higher frequency of weird crashes (accessing null-ish pointer) after upgrading to 5.19.3,
I am building kernel now with 5.19.5 and that AER patch reverted.  We will
test to see if that solves the crashes.

>> Also, any idea what this error in my logs is actually indicating?
> 
> Your PCIe controller received non-fatal, but uncorrected error. There is
> also indication of Unsupported Request Completion Status. Unsupported
> Request is generated by PCIe device when controller / host / kernel try
> to do something which is not supported by device; pretty generic error.
> PCIe base spec describe lot of scenarios when card should return this
> error. Maybe some more detailed information are in TLP Header hexdump,
> but I cannot decode it now.
> 
> Basically it is PCIe card driver who could know how fatal it is that
> issue and how to recover from it. But as you can see intel wifi driver
> does not implement that callback.

Odds of me getting a good answer on that are pretty small.

Thanks,
Ben

-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

Powered by blists - more mailing lists