[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20230103191418.GA1011392@bhelgaas>
Date: Tue, 3 Jan 2023 13:14:18 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Rajat Khandelwal <rajat.khandelwal@...ux.intel.com>
Cc: ruscur@...sell.cc, oohall@...il.com, bhelgaas@...gle.com,
linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, rajat.khandelwal@...el.com,
Paul Menzel <pmenzel@...gen.mpg.de>,
"Neftin, Sasha" <sasha.neftin@...el.com>,
Leon Romanovsky <leon@...nel.org>,
Frederick Zhang <frederick888@...ndere.moe>
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable
errors
[+cc Paul, Sasha, Leon, Frederick]
(Please cc folks who have commented on previous versions of your
patch.)
On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
>
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.
I want a better understanding of why we have so many errors before
rate-limiting everybody.
> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0: [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0: [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0: [12] Timeout
Please remove the timestamps; they don't contribute to understanding
the problem.
> This gets repeated continuously, thus inundating the buffer.
Did you verify that we actually clear the Correctable Error Status
register?
https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue. The issue Frederick is seeing happens when resuming
from sleep. Is there some event that triggers the correctable errors
you see?
Bjorn
Powered by blists - more mailing lists