[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8286502.jJDZkT8p0M@fdefranc-mobl3>
Date: Wed, 23 Oct 2024 15:35:18 +0200
From: "Fabio M. De Francesco" <fabio.m.de.francesco@...ux.intel.com>
To: linux-kernel@...r.kernel.org, Dan Williams <dan.j.williams@...el.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown <lenb@...nel.org>,
Mahesh J Salgaonkar <mahesh@...ux.ibm.com>,
Oliver O'Halloran <oohall@...il.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
linux-acpi@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
linux-pci@...r.kernel.org, Dan Williams <dan.j.williams@...el.com>
Subject: Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section
On Tuesday, August 6, 2024 9:56:24 PM GMT+2 Dan Williams wrote:
> Fabio M. De Francesco wrote:
> > Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI
> > v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd().
>
> I think the critical detail is is that print_extlog_rcd() is only
> triggered when ras_userspace_consumers() returns true. The observation
> is that ras_userspace_consumers() hides information from the trace path
> when the intended purpose of it was to hide duplicate emissions to the
> kernel log when userspace is watching the tracepoints.
>
> Setting aside whether ras_userspace_consumers() is still a good idea or
> not, it is obvious that this patch as is may surprise environments that
> start seeing kernel error logs where the kernel was silent before.
>
> I think the path of least surprise would be to make sure that
> pci_print_aer() optionally skips emitting to the kernel log when not
> needed wanted.
Sorry for replying so late...
I'm not entirely sure that users would not prefer to be surprised by
_finally_ seeing kernel error logs for failing PCIe components. I suspect
that users might have been confused by not seeing any output.
> So perhaps first do a lead-in patch to optionally quiet the print
> messages in pci_print_aer() and then pass in KERN_DEBUG from the
> extlog_print() path. Then we can decide later what to do about
> ras_userspace_consumers().
Anyway, I'll do it.
> > the similar ghes_do_proc() (GHES) prints to kernel log and calls
> > pci_print_aer() to report via the ftrace infrastructure.
> >
> > Add support to report the CPER PCIe Error section also via the ftrace
> > infrastructure by calling pci_print_aer() to make ELOG act consistently
> > with GHES.
>
> You might also want to explain a bit about the motivation for this which
> is that I/O Machine Check Arcitecture events may signal failing PCIe
> components or links. The AER event contains details on what was
> happening on the wire when the error was signaled.
Yes, I agree.
> >
> > Cc: Dan Williams <dan.j.williams@...el.com>
> > Signed-off-by: Fabio M. De Francesco
<fabio.m.de.francesco@...ux.intel.com>
> > ---
> > drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++
> > drivers/pci/pcie/aer.c | 2 +-
> > include/linux/aer.h | 13 +++++++++++--
> > 3 files changed, 42 insertions(+), 3 deletions(-)
> >
> > [...]
> >
> > + pci_print_aer(pdev, aer_severity, aer);
>
> ...per above this would become:
>
> pci_print_aer(KERN_DEBUG, pdev, aer_severity, aer);
>
> [..]
>
> Rest of the changes look good to me.
I need to be sure that I understood...
void pci_print_aer(char *level, struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
[...]
if (printk_get_level(level) <= console_loglevel) {
pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
status, mask);
__aer_print_error(dev, &info);
pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
aer_error_layer[layer], aer_agent_string[agent]);
if (aer_severity != AER_CORRECTABLE)
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
if (tlp_header_valid)
__print_tlp_header(dev, &aer->header_log);
}
[...]
}
It would require changing a couple of call sites, like in
aer_recover_work_func():
pci_print_aer(KERN_ERR, pdev, entry.severity, entry.regs);
Would you please confirm that the code shown above is what
you asked for?
Thanks,
Fabio
Powered by blists - more mailing lists