linux-kernel - Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20230103191418.GA1011392@bhelgaas>
Date:   Tue, 3 Jan 2023 13:14:18 -0600
From:   Bjorn Helgaas <helgaas@...nel.org>
To:     Rajat Khandelwal <rajat.khandelwal@...ux.intel.com>
Cc:     ruscur@...sell.cc, oohall@...il.com, bhelgaas@...gle.com,
        linuxppc-dev@...ts.ozlabs.org, linux-pci@...r.kernel.org,
        linux-kernel@...r.kernel.org, rajat.khandelwal@...el.com,
        Paul Menzel <pmenzel@...gen.mpg.de>,
        "Neftin, Sasha" <sasha.neftin@...el.com>,
        Leon Romanovsky <leon@...nel.org>,
        Frederick Zhang <frederick888@...ndere.moe>
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable
 errors

[+cc Paul, Sasha, Leon, Frederick]

(Please cc folks who have commented on previous versions of your
patch.)

On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
> 
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.

I want a better understanding of why we have so many errors before
rate-limiting everybody.

> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0:    [12] Timeout

Please remove the timestamps; they don't contribute to understanding
the problem.

> This gets repeated continuously, thus inundating the buffer.

Did you verify that we actually clear the Correctable Error Status
register?

https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue.  The issue Frederick is seeing happens when resuming
from sleep.  Is there some event that triggers the correctable errors
you see?

Bjorn