[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66b3d8ec30e74_2142c2945d@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Date: Wed, 7 Aug 2024 13:28:28 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Bjorn Helgaas <helgaas@...nel.org>, "Fabio M. De Francesco"
<fabio.m.de.francesco@...ux.intel.com>
CC: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>, Oliver O'Halloran
<oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
<linux-kernel@...r.kernel.org>, <linux-acpi@...r.kernel.org>,
<linuxppc-dev@...ts.ozlabs.org>, <linux-pci@...r.kernel.org>, Dan Williams
<dan.j.williams@...el.com>
Subject: Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section
[ add Boris ]
Bjorn Helgaas wrote:
> On Mon, May 27, 2024 at 04:43:41PM +0200, Fabio M. De Francesco wrote:
> > Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI
> > v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd(). Instead,
> > the similar ghes_do_proc() (GHES) prints to kernel log and calls
> > pci_print_aer() to report via the ftrace infrastructure.
> >
> > Add support to report the CPER PCIe Error section also via the ftrace
> > infrastructure by calling pci_print_aer() to make ELOG act consistently
> > with GHES.
> >
> > Cc: Dan Williams <dan.j.williams@...el.com>
> > Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@...ux.intel.com>
> > ---
> > drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> > drivers/pci/pcie/aer.c | 2 +-
> > include/linux/aer.h | 13 +++++++++++--
> > 3 files changed, 42 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> > index e025ae390737..007ce96f8672 100644
> > --- a/drivers/acpi/acpi_extlog.c
> > +++ b/drivers/acpi/acpi_extlog.c
> > @@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx,
> > return 1;
> > }
> >
> > +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err,
> > + int severity)
> > +{
> > + struct aer_capability_regs *aer;
> > + struct pci_dev *pdev;
> > + unsigned int devfn;
> > + unsigned int bus;
> > + int aer_severity;
> > + int domain;
> > +
> > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
> > + aer_severity = cper_severity_to_aer(severity);
> > + aer = (struct aer_capability_regs *)pcie_err->aer_info;
> > + domain = pcie_err->device_id.segment;
> > + bus = pcie_err->device_id.bus;
> > + devfn = PCI_DEVFN(pcie_err->device_id.device,
> > + pcie_err->device_id.function);
> > + pdev = pci_get_domain_bus_and_slot(domain, bus, devfn);
> > + if (!pdev)
> > + return;
> > + pci_print_aer(pdev, aer_severity, aer);
> > + pci_dev_put(pdev);
> > + }
>
> I'm 100% in favor of making error reporting work and look the same
> across GHES and ELOG. But I do have to gripe a bit...
>
> It's already unfortunate that GHES and the native AER handling are
> separate paths that lead to the same place (__aer_print_error()).
>
> I'm sorry that we need to add a third path that again does
> fundamentally the same thing. The fact that they're separate means
> all the design, reviewing, testing, and maintenance effort is diluted,
> and error handling always gets too little love in the first place.
> I think this is a recipe for confusion.
>
> ghes_do_proc # GHES
> apei_estatus_for_each_section
> ...
> if (guid_equal(sec_type, &CPER_SEC_PCIE))
> ghes_handle_aer
> cper_severity_to_aer
> aer_recover_queue
> kfifo_in_spinlocked(&aer_recover_ring) # add to queue
> aer_recover_work_func # another thread
> kfifo_get(&aer_recover_ring) # remove from queue
> pci_print_aer
> __aer_print_error <---
>
> aer_irq # native AER
> kfifo_put(&aer_fifo) # add to queue
> aer_isr # another thread
> kfifo_get(&aer_fifo) # remove from queue
> ...
> aer_isr_one_error
> aer_process_err_devices
> aer_print_error
> __aer_print_error <---
>
> extlog_print # extlog (x86 only)
> apei_estatus_for_each_section
> ...
> if (guid_equal(sec_type, &CPER_SEC_PCIE))
> extlog_print_pcie
> cper_severity_to_aer
> pci_get_domain_bus_and_slot
> pci_print_aer
> __aer_print_error <---
>
> And we also have CXL paths that lead to __aer_print_error(), although
> it seems like they at least start in the native AER (and maybe GHES?)
> path and branch out somewhere. My head is spinning.
>
> Do I *object* to this patch? No, not really; it's a trivial change in
> drivers/pci/, and Rafael can add my
>
> Acked-by: Bjorn Helgaas <bhelgaas@...gle.com>
>
> as needed. But I am afraid we're making ourselves a maintenance
> headache.
To be honest, I am too. Upon discovering that extlog_print() behaves
differently than ghes_do_proc(), I had the snarky thought "great, can we
now just go ahead and deprecate the extlog path, it's just a source of
maintenance pain.".
So *if*we keep acpi_extlog it then I definitely think it should be
consistent with other CPER handlers (needs this patch). But, I am also
open to entertaining "deprecate it".
To me, the fact that ras_userspace_consumers() is only honored for
acpi_extlog is clear evidence that the kernel has already painted itself
into a confusing user ABI corner and maybe the proper path forward at
this point is to cut acpi_extlog loose.
Powered by blists - more mailing lists