[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250214152840.00002172@huawei.com>
Date: Fri, 14 Feb 2025 15:28:40 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: "Bowman, Terry" <terry.bowman@....com>
CC: Dan Williams <dan.j.williams@...el.com>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
<nifan.cxl@...il.com>, <dave@...olabs.net>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
<bhelgaas@...gle.com>, <mahesh@...ux.ibm.com>, <ira.weiny@...el.com>,
<oohall@...il.com>, <Benjamin.Cheatham@....com>, <rrichter@....com>,
<nathan.fontenot@....com>, <Smita.KoralahalliChannabasappa@....com>,
<lukas@...ner.de>, <ming.li@...omail.com>,
<PradeepVineshReddy.Kodamati@....com>, <shiju.jose@...wei.com>
Subject: Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in
existing RAS handlers
On Wed, 12 Feb 2025 18:08:13 -0600
"Bowman, Terry" <terry.bowman@....com> wrote:
> On 2/12/2025 4:59 PM, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The CXL RAS handlers do not currently log if the RAS registers are
> >> unmapped. This is needed in order to help debug CXL error handling. Update
> >> the CXL driver to log a warning message if the RAS register block is
> >> unmapped.
> >>
> >> Also, add type check before processing EP or RCH DP.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@....com>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
> >> Reviewed-by: Ira Weiny <ira.weiny@...el.com>
> >> Reviewed-by: Gregory Price <gourry@...rry.net>
> >> ---
> >> drivers/cxl/core/pci.c | 20 ++++++++++++++------
> >> 1 file changed, 14 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >> index 69bb030aa8e1..af809e7cbe3b 100644
> >> --- a/drivers/cxl/core/pci.c
> >> +++ b/drivers/cxl/core/pci.c
> >> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev,
> >> void __iomem *addr;
> >> u32 status;
> >>
> >> - if (!ras_base)
> >> + if (!ras_base) {
> >> + dev_warn_once(dev, "CXL RAS register block is not mapped");
> >> return;
> >> + }
> >>
> >> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> >> status = readl(addr);
> >> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> >> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> >> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> >> + return;
> >> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> >> +
> >> + if (is_cxl_memdev(dev))
> >> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> > I think trace_cxl_aer_correctable_error() should always fire and this
> > should somehow be unified with the CPER record trace-event for protocol
> > errors.
> >
> > The only usage of @memdev in this trace is retrieving the device serial
> > number. If the device is not a memdev then print zero for the serial
> > number, or something like that.
> >
> > In the end RAS daemon should only need to enable one trace event to get
> > protocol errors and header logs from ports or endpoints, either
> > natively, or via CPER.
> >
> That would be: we use 'struct *device' instead of 'struct *cxl_memdev'
> and pass serial# in as a parameter (0 in non-EP cases)?
For a USP may well have a serial number cap. Might be worth getting it?
slightly nasty thing here will be change of memdev=%s to devname=%s or
similar. Meh. It's a TP_printk() I'm not sure anyone will care.
Need to be careful with the tracepoint itself though and the rasdaemon
etc handling.
https://github.com/mchehab/rasdaemon/blob/master/ras-cxl-handler.c#L351
We may have to just add another field.
>
> >> - }
> >> }
> >>
> >> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> >> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> >> u32 status;
> >> u32 fe;
> >>
> >> - if (!ras_base)
> >> + if (!ras_base) {
> >> + dev_warn_once(dev, "CXL RAS register block is not mapped");
> > Is this a "never can happen" print? It seems like an oversight in an
> > upper layer to get this far down error reporting without the registers
> > mapped.
> >
> > Like maybe this is a bug in a driver that should crash, or the driver
> > should not be registering broken error handlers?
> Correct. The error handler assignment and enablement is gated by RAS mapping
> in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>
> Terry
>
>
>
>
>
Powered by blists - more mailing lists