lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 30 Aug 2022 15:16:24 -0700 From: Ben Greear <greearb@...delatech.com> To: Pali Rohár <pali@...nel.org> Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>, bjorn@...gaas.com, LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org, Stefan Roese <sr@...x.de>, Bjorn Helgaas <bhelgaas@...gle.com>, "Rafael J. Wysocki" <rjw@...ysocki.net>, Bharat Kumar Gogada <bharat.kumar.gogada@...inx.com>, Michal Simek <michal.simek@...inx.com>, Yao Hongbo <yaohongbo@...ux.alibaba.com>, Naveen Naidu <naveennaidu479@...il.com>, Sasha Levin <sashal@...nel.org> Subject: Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability() On 8/30/22 2:55 PM, Pali Rohár wrote: > On Tuesday 30 August 2022 14:28:14 Ben Greear wrote: >> On 8/30/22 1:58 PM, Pali Rohár wrote: >>> On Tuesday 30 August 2022 13:47:48 Ben Greear wrote: >>>> On 8/23/22 11:41 PM, Greg Kroah-Hartman wrote: >>>>> On Tue, Aug 23, 2022 at 07:20:14AM -0500, Bjorn Helgaas wrote: >>>>>> On Tue, Aug 23, 2022, 6:35 AM Greg Kroah-Hartman <gregkh@...uxfoundation.org> >>>>>> wrote: >>>>>> >>>>>>> From: Stefan Roese <sr@...x.de> >>>>>>> >>>>>>> [ Upstream commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 ] >>>>>>> >>>>>> >>>>>> There's an open regression related to this commit: >>>>>> >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=216373 >>>>> >>>>> This is already in the following released stable kernels: >>>>> 5.10.137 5.15.61 5.18.18 5.19.2 >>>>> >>>>> I'll go drop it from the 4.19 and 5.4 queues, but when this gets >>>>> resolved in Linus's tree, make sure there's a cc: stable on the fix so >>>>> that we know to backport it to the above branches as well. Or at the >>>>> least, a "Fixes:" tag. >>>> >>>> This is still in 5.19.5. We saw some funny iwlwifi crashes in 5.19.3+ >>>> that we did not see in 5.19.0+. I just bisected the scary looking AER errors to this >>>> patch, though I do not know for certain if it causes the iwlwifi related crashes yet. >>>> >>>> In general, from reading the commit msg, this patch doesn't seem to be a great candidate >>>> for stable in general. Does it fix some important problem? >>>> >>>> In case it helps, here is example of what I see in dmesg. The kernel crashes in iwlwifi >>>> had to do with rx messages from the firmware, and some warnings lead me to believe that >>>> pci messages were slow coming back and/or maybe duplicated. So maybe this AER patch changes >>>> timing or otherwise screws up the PCI adapter boards we use... >>> >>> From that log I have feeling that issue is in that intel wifi card and >>> it was there also before that commit. Card is crashing (or something >>> other happens on PCIe bus) and because kernel had disabled Error >>> Reporting for this card, nobody spotted any issue. And that commit just >>> opened eye to kernel to see those errors. >>> >>> I think this issue should be reported to intel wifi card developers, >>> maybe they comment it, why card is reporting errors. >> >> My main concern is not that AER messages started showing up, but that there >> started being kernel NPE and WARNINGS showing up sometime after 5.19.0. >> >> Possibly this AER thing is mis-direction and the real bug is elsewhere, >> but since the bugzilla also indicated (different) driver crashes, then >> I am suspicious this changes things more significantly, at least in a subset >> of hardware out there. > > Yea, of course, this is something needed to investigate. > > Anyway, do you see driver crashes? Or just these AER errors? And are > your PCIe cards working, or after seeing these messages in dmesg they > stopped working? It is needed to know if you are just spammed by tons of > lines in dmesg and otherwise everything works. Or if after AER errors > your PCIe devices stop working and rebooting system is required. We did see higher frequency of weird crashes (accessing null-ish pointer) after upgrading to 5.19.3, I am building kernel now with 5.19.5 and that AER patch reverted. We will test to see if that solves the crashes. >> Also, any idea what this error in my logs is actually indicating? > > Your PCIe controller received non-fatal, but uncorrected error. There is > also indication of Unsupported Request Completion Status. Unsupported > Request is generated by PCIe device when controller / host / kernel try > to do something which is not supported by device; pretty generic error. > PCIe base spec describe lot of scenarios when card should return this > error. Maybe some more detailed information are in TLP Header hexdump, > but I cannot decode it now. > > Basically it is PCIe card driver who could know how fatal it is that > issue and how to recover from it. But as you can see intel wifi driver > does not implement that callback. Odds of me getting a good answer on that are pretty small. Thanks, Ben -- Ben Greear <greearb@...delatech.com> Candela Technologies Inc http://www.candelatech.com
Powered by blists - more mailing lists