[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d56fcbea-8405-4f61-9c32-63db88f1483c@amd.com>
Date: Wed, 12 Feb 2025 18:08:13 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Dan Williams <dan.j.williams@...el.com>, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
nifan.cxl@...il.com, dave@...olabs.net, jonathan.cameron@...wei.com,
dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
bhelgaas@...gle.com, mahesh@...ux.ibm.com, ira.weiny@...el.com,
oohall@...il.com, Benjamin.Cheatham@....com, rrichter@....com,
nathan.fontenot@....com, Smita.KoralahalliChannabasappa@....com,
lukas@...ner.de, ming.li@...omail.com, PradeepVineshReddy.Kodamati@....com
Subject: Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in
existing RAS handlers
On 2/12/2025 4:59 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> The CXL RAS handlers do not currently log if the RAS registers are
>> unmapped. This is needed in order to help debug CXL error handling. Update
>> the CXL driver to log a warning message if the RAS register block is
>> unmapped.
>>
>> Also, add type check before processing EP or RCH DP.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>> Reviewed-by: Ira Weiny <ira.weiny@...el.com>
>> Reviewed-by: Gregory Price <gourry@...rry.net>
>> ---
>> drivers/cxl/core/pci.c | 20 ++++++++++++++------
>> 1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 69bb030aa8e1..af809e7cbe3b 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev,
>> void __iomem *addr;
>> u32 status;
>>
>> - if (!ras_base)
>> + if (!ras_base) {
>> + dev_warn_once(dev, "CXL RAS register block is not mapped");
>> return;
>> + }
>>
>> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>> status = readl(addr);
>> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> + return;
>> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> + if (is_cxl_memdev(dev))
>> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> I think trace_cxl_aer_correctable_error() should always fire and this
> should somehow be unified with the CPER record trace-event for protocol
> errors.
>
> The only usage of @memdev in this trace is retrieving the device serial
> number. If the device is not a memdev then print zero for the serial
> number, or something like that.
>
> In the end RAS daemon should only need to enable one trace event to get
> protocol errors and header logs from ports or endpoints, either
> natively, or via CPER.
>
That would be: we use 'struct *device' instead of 'struct *cxl_memdev'
and pass serial# in as a parameter (0 in non-EP cases)?
>> - }
>> }
>>
>> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>> u32 status;
>> u32 fe;
>>
>> - if (!ras_base)
>> + if (!ras_base) {
>> + dev_warn_once(dev, "CXL RAS register block is not mapped");
> Is this a "never can happen" print? It seems like an oversight in an
> upper layer to get this far down error reporting without the registers
> mapped.
>
> Like maybe this is a bug in a driver that should crash, or the driver
> should not be registering broken error handlers?
Correct. The error handler assignment and enablement is gated by RAS mapping
in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
Terry
Powered by blists - more mailing lists