[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250523162110.GA1564297@bhelgaas>
Date: Fri, 23 May 2025 11:21:10 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: linux-pci@...r.kernel.org
Cc: Jon Pan-Doh <pandoh@...gle.com>,
Karolina Stolarek <karolina.stolarek@...cle.com>,
Weinan Liu <wnliu@...gle.com>,
Martin Petersen <martin.petersen@...cle.com>,
Ben Fuller <ben.fuller@...cle.com>,
Drew Walton <drewwalton@...rosoft.com>,
Anil Agrawal <anilagrawal@...a.com>,
Tony Luck <tony.luck@...el.com>,
Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@...ux.intel.com>,
Lukas Wunner <lukas@...ner.de>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Sargun Dhillon <sargun@...a.com>,
"Paul E . McKenney" <paulmck@...nel.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>,
Kai-Heng Feng <kaihengf@...dia.com>,
Keith Busch <kbusch@...nel.org>, Robert Richter <rrichter@....com>,
Terry Bowman <terry.bowman@....com>,
Shiju Jose <shiju.jose@...wei.com>,
Dave Jiang <dave.jiang@...el.com>, linux-kernel@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: [PATCH v8 00/20] Rate limit AER logs
On Thu, May 22, 2025 at 06:21:06PM -0500, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@...gle.com>
>
> This work is mostly due to Jon Pan-Doh and Karolina Stolarek. I rebased
> this to v6.15-rc1, factored out some of the trace and statistics updates,
> and added some minor cleanups.
>
> I pushed this to pci/aer at
> https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/log/?h=aer
> (head a524e63307cf ("PCI/AER: Add sysfs attributes for log ratelimits"))
> and appended the interdiff from v7 to v8 below.
>
> Proposal
> ========
>
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits for
> more robust error logging. Allow userspace to configure ratelimits via
> sysfs knobs.
>
> Motivation
> ==========
>
> Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
> of devices), affects repairabilitiy flows for fleet operators.
>
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services will allow for more
> predictable repair flows and decrease machine downtime.
>
> Background
> ==========
>
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
>
> There have been previous attempts to add ratelimits to AER logs ([4], [5]).
> The most recent attempt[5] has many similarities with the proposed
> approach.
>
>
> v8:
> - Rename sysfs ratelimit burst files:
> ratelimit_burst_cor_log -> correctable_ratelimit_burst (Sathy)
> ratelimit_burst_uncor_log -> nonfatal_ratelimit_burst
> - Split sysfs ratelimit_log_enable for correctable and nonfatal and make it
> an interval instead of a toggle:
> ratelimit_log_enable -> correctable_ratelimit_interval_ms
> -> nonfatal_ratelimit_interval_ms
> - Rework aer_get_device_error_info() and aer_print_error() to take an index
> instead of pci_dev pointer
> - Move trace_aer_event() out of pci_dev_aer_stats_incr() (Jonathan)
> - Move AER_FATAL checking to aer_ratelimit() to avoid calling
> __ratelimit(nonfatal_ratelimit) when we know we don't want to ratelimit
> fatal errors (Jonathan)
> - Move all Error Source ID string building into aer_print_source() instead
> of putting part in caller (Jonathan)
> - Rename struct aer_err_info.ratelimit -> ratelimit_print[] (Jonathan)
> - Pass printk level into pcie_print_tlp_log() (Jonathan)
> - Rework Error Source ratelimiting vs detail ratelimiting (Jonathan)
> v7: https://lore.kernel.org/r/20250520215047.1350603-1-helgaas@kernel.org
> - Update sysfs doc target kernel version & date (Ilpo)
> - Fix sysfs doc "AER ratelimiting" typo (Ilpo)
> - Ratelimit Correctable and Non-Fatal but not Fatal errors (Sathy)
> - Rename "struct aer_report" to "aer_info" (Sathy)
> - Expand comments about combining ratelimit for multiple devices (Ilpo)
> - Rework Error Source logging ratelimiting (Sathy)
> - Factor out aer_isr_one_error_type() to reduce code duplication
> - Log DPC errors, which are all Fatal, at KERN_ERR (Sathy)
> - Improve dpc_process_error() structure (Ilpo)
> v6: https://lore.kernel.org/r/20250519213603.1257897-1-helgaas@kernel.org
> - Rebase to v6.15-rc1
> - Initialize struct aer_err_info completely before using it
> - Log DPC Error Source ID only when it's valid
> - Consolidate AER Error Source ID logging to one place
> - Tidy Error Source ID bus/dev/fn decoding using macros
> - Rename aer_print_port_info() to aer_print_source()
> - Consolidate trace events and statistic updates to one non-ratelimited place
> - Save log level in struct aer_err_info instead of passing as parameter
> v5: https://lore.kernel.org/r/20250321015806.954866-1-pandoh@google.com
> - Handle multi-error AER by evaluating ratelimits once and storing result
> - Reword/rename commit messages/functions/variable
> v4: https://lore.kernel.org/r/20250320082057.622983-1-pandoh@google.com
> - Fix bug where trace not emitted with malformed aer_err_info
> - Extend ratelimit to malformed aer_err_info
> - Update commit messages with patch motivation
> - Squash AER sysfs filename change (Patch 8)
> v3: https://lore.kernel.org/r/20250319084050.366718-1-pandoh@google.com
> - Ratelimit aer_print_port_info() (drop Patch 1)
> - Add ratelimit enable toggle
> - Move trace outside of ratelimit
> - Split log level (Patch 2) into two
> - More descriptive documentation/sysfs naming
> v2: https://lore.kernel.org/r/20250214023543.992372-1-pandoh@google.com
> - Rebased on top of pci/aer (6.14.rc-1)
> - Split series into log and IRQ ratelimits (defer patch 5)
> - Dropped patch 8 (Move AER sysfs)
> - Added log level cleanup patch[7] from Karolina's series
> - Fixed bug where dpc errors didn't increment counters
> - "X callbacks suppressed" message on ratelimit release -> immediately
> - Separate documentation into own patch
> v1: https://lore.kernel.org/r/20250115074301.3514927-1-pandoh@google.com
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
> [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
> [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> [6]
> https://lore.kernel.org/linux-pci/8bcb8c9a7b38ce3bdaca5a64fe76f08b0b337511.1742202797.git.k
> arolina.stolarek@...cle.com/
> [7]
> https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.k
> arolina.stolarek@...cle.com/
>
>
>
> Bjorn Helgaas (13):
> PCI/DPC: Initialize aer_err_info before using it
> PCI/DPC: Log Error Source ID only when valid
> PCI/AER: Factor COR/UNCOR error handling out from aer_isr_one_error()
> PCI/AER: Consolidate Error Source ID logging in
> aer_isr_one_error_type()
> PCI/AER: Extract bus/dev/fn in aer_print_port_info() with
> PCI_BUS_NUM(), etc
> PCI/AER: Move aer_print_source() earlier in file
> PCI/AER: Initialize aer_err_info before using it
> PCI/AER: Simplify pci_print_aer()
> PCI/AER: Update statistics before ratelimiting
> PCI/AER: Trace error event before ratelimiting
> PCI/ERR: Add printk level to pcie_print_tlp_log()
> PCI/AER: Convert aer_get_device_error_info(), aer_print_error() to
> index
> PCI/AER: Simplify add_error_device()
>
> Jon Pan-Doh (4):
> PCI/AER: Rename aer_print_port_info() to aer_print_source()
> PCI/AER: Ratelimit correctable and non-fatal error logging
> PCI/AER: Add ratelimits to PCI AER Documentation
> PCI/AER: Add sysfs attributes for log ratelimits
>
> Karolina Stolarek (3):
> PCI/AER: Check log level once and remember it
> PCI/AER: Reduce pci_print_aer() correctable error level to
> KERN_WARNING
> PCI/AER: Rename struct aer_stats to aer_info
>
> ...es-aer_stats => sysfs-bus-pci-devices-aer} | 44 ++
> Documentation/PCI/pcieaer-howto.rst | 17 +-
> drivers/pci/pci-sysfs.c | 1 +
> drivers/pci/pci.h | 13 +-
> drivers/pci/pcie/aer.c | 441 +++++++++++++-----
> drivers/pci/pcie/dpc.c | 73 +--
> drivers/pci/pcie/tlp.c | 6 +-
> include/linux/pci.h | 2 +-
> 8 files changed, 430 insertions(+), 167 deletions(-)
> rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (72%)
I applied this series to pci/aer for v6.16 with the updates below
suggested by Sathy and Ilpo.
My heartfelt thanks to the authors, Jon and Karolina, for seeing the
need for this and putting in the effort, and to all the reviewers who
put so much time and care into reading and polishing it on such a
tight schedule.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 6c331695af58..70ac66188367 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -786,17 +786,14 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
{
- struct ratelimit_state *ratelimit;
-
- if (severity == AER_FATAL)
- return 1; /* AER_FATAL not ratelimited */
-
- if (severity == AER_CORRECTABLE)
- ratelimit = &dev->aer_info->correctable_ratelimit;
- else
- ratelimit = &dev->aer_info->nonfatal_ratelimit;
-
- return __ratelimit(ratelimit);
+ switch (severity) {
+ case AER_NONFATAL:
+ return __ratelimit(&dev->aer_info->nonfatal_ratelimit);
+ case AER_CORRECTABLE:
+ return __ratelimit(&dev->aer_info->correctable_ratelimit);
+ default:
+ return 1; /* Don't ratelimit fatal errors */
+ }
}
static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
@@ -841,7 +838,7 @@ void aer_print_error(struct aer_err_info *info, int i)
int layer, agent, id;
const char *level = info->level;
- if (i >= AER_MAX_MULTI_ERR_DEVICES)
+ if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
return;
dev = info->dev[i];
Powered by blists - more mailing lists