linux-kernel - Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in existing RAS handlers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d56fcbea-8405-4f61-9c32-63db88f1483c@amd.com>
Date: Wed, 12 Feb 2025 18:08:13 -0600
From: "Bowman, Terry" <terry.bowman@....com>
To: Dan Williams <dan.j.williams@...el.com>, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
 nifan.cxl@...il.com, dave@...olabs.net, jonathan.cameron@...wei.com,
 dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com,
 bhelgaas@...gle.com, mahesh@...ux.ibm.com, ira.weiny@...el.com,
 oohall@...il.com, Benjamin.Cheatham@....com, rrichter@....com,
 nathan.fontenot@....com, Smita.KoralahalliChannabasappa@....com,
 lukas@...ner.de, ming.li@...omail.com, PradeepVineshReddy.Kodamati@....com
Subject: Re: [PATCH v7 10/17] cxl/pci: Add log message and add type check in
 existing RAS handlers



On 2/12/2025 4:59 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> The CXL RAS handlers do not currently log if the RAS registers are
>> unmapped. This is needed in order to help debug CXL error handling. Update
>> the CXL driver to log a warning message if the RAS register block is
>> unmapped.
>>
>> Also, add type check before processing EP or RCH DP.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@....com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@...wei.com>
>> Reviewed-by: Ira Weiny <ira.weiny@...el.com>
>> Reviewed-by: Gregory Price <gourry@...rry.net>
>> ---
>>  drivers/cxl/core/pci.c | 20 ++++++++++++++------
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 69bb030aa8e1..af809e7cbe3b 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -658,15 +658,19 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  	void __iomem *addr;
>>  	u32 status;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>>  		return;
>> +	}
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	if (is_cxl_memdev(dev))
>>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> I think trace_cxl_aer_correctable_error() should always fire and this
> should somehow be unified with the CPER record trace-event for protocol
> errors.
>
> The only usage of @memdev in this trace is retrieving the device serial
> number. If the device is not a memdev then print zero for the serial
> number, or something like that.
>
> In the end RAS daemon should only need to enable one trace event to get
> protocol errors and header logs from ports or endpoints, either
> natively, or via CPER.
>
That would be: we use 'struct *device' instead of 'struct *cxl_memdev'
and pass serial# in as a parameter (0 in non-EP cases)?

>> -	}
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -702,8 +706,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>>  	u32 status;
>>  	u32 fe;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
> Is this a "never can happen" print? It seems like an oversight in an
> upper layer to get this far down error reporting without the registers
> mapped.
>
> Like maybe this is a bug in a driver that should crash, or the driver
> should not be registering broken error handlers?
Correct. The error handler assignment and enablement is gated by RAS mapping
in cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().

Terry