lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250612172502.0000692e@huawei.com>
Date: Thu, 12 Jun 2025 17:25:02 +0100
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: "Bowman, Terry" <terry.bowman@....com>
CC: Shiju Jose <shiju.jose@...wei.com>, "PradeepVineshReddy.Kodamati@....com"
	<PradeepVineshReddy.Kodamati@....com>, "dave@...olabs.net"
	<dave@...olabs.net>, "dave.jiang@...el.com" <dave.jiang@...el.com>,
	"alison.schofield@...el.com" <alison.schofield@...el.com>,
	"vishal.l.verma@...el.com" <vishal.l.verma@...el.com>, "ira.weiny@...el.com"
	<ira.weiny@...el.com>, "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
	"bhelgaas@...gle.com" <bhelgaas@...gle.com>, "bp@...en8.de" <bp@...en8.de>,
	"ming.li@...omail.com" <ming.li@...omail.com>, "dan.carpenter@...aro.org"
	<dan.carpenter@...aro.org>, "Smita.KoralahalliChannabasappa@....com"
	<Smita.KoralahalliChannabasappa@....com>, "kobayashi.da-06@...itsu.com"
	<kobayashi.da-06@...itsu.com>, "yanfei.xu@...el.com" <yanfei.xu@...el.com>,
	"rrichter@....com" <rrichter@....com>, "peterz@...radead.org"
	<peterz@...radead.org>, "colyli@...e.de" <colyli@...e.de>,
	"uaisheng.ye@...el.com" <uaisheng.ye@...el.com>,
	"fabio.m.de.francesco@...ux.intel.com"
	<fabio.m.de.francesco@...ux.intel.com>, "ilpo.jarvinen@...ux.intel.com"
	<ilpo.jarvinen@...ux.intel.com>, "yazen.ghannam@....com"
	<yazen.ghannam@....com>, "linux-cxl@...r.kernel.org"
	<linux-cxl@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "linux-pci@...r.kernel.org"
	<linux-pci@...r.kernel.org>, Mauro Carvalho Chehab
	<mchehab+huawei@...nel.org>
Subject: Re: [PATCH v9 10/16] cxl/pci: Unify CXL trace logging for CXL
 Endpoints and CXL Ports

On Fri, 6 Jun 2025 10:24:25 -0500
"Bowman, Terry" <terry.bowman@....com> wrote:

> On 6/6/2025 9:41 AM, Bowman, Terry wrote:
> >
> > On 6/6/2025 4:08 AM, Shiju Jose wrote:  
> >>> -----Original Message-----
> >>> From: Terry Bowman <terry.bowman@....com>
> >>> Sent: 03 June 2025 18:23
> >>> To: PradeepVineshReddy.Kodamati@....com; dave@...olabs.net; Jonathan
> >>> Cameron <jonathan.cameron@...wei.com>; dave.jiang@...el.com;
> >>> alison.schofield@...el.com; vishal.l.verma@...el.com; ira.weiny@...el.com;
> >>> dan.j.williams@...el.com; bhelgaas@...gle.com; bp@...en8.de;
> >>> ming.li@...omail.com; Shiju Jose <shiju.jose@...wei.com>;
> >>> dan.carpenter@...aro.org; Smita.KoralahalliChannabasappa@....com;
> >>> kobayashi.da-06@...itsu.com; terry.bowman@....com; yanfei.xu@...el.com;
> >>> rrichter@....com; peterz@...radead.org; colyli@...e.de;
> >>> uaisheng.ye@...el.com; fabio.m.de.francesco@...ux.intel.com;
> >>> ilpo.jarvinen@...ux.intel.com; yazen.ghannam@....com; linux-
> >>> cxl@...r.kernel.org; linux-kernel@...r.kernel.org; linux-pci@...r.kernel.org
> >>> Subject: [PATCH v9 10/16] cxl/pci: Unify CXL trace logging for CXL Endpoints and
> >>> CXL Ports
> >>>
> >>> CXL currently has separate trace routines for CXL Port errors and CXL Endpoint
> >>> errors. This is inconvenient for the user because they must enable
> >>> 2 sets of trace routines. Make updates to the trace logging such that a single
> >>> trace routine logs both CXL Endpoint and CXL Port protocol errors.
> >>>
> >>> Rename the 'host' field from the CXL Endpoint trace to 'parent' in the unified
> >>> trace routines. 'host' does not correctly apply to CXL Port devices. Parent is more
> >>> general and applies to CXL Port devices and CXL Endpoints.
> >>>
> >>> Add serial number parameter to the trace logging. This is used for EPs and 0 is
> >>> provided for CXL port devices without a serial number.
> >>>
> >>> Below is output of correctable and uncorrectable protocol error logging.
> >>> CXL Root Port and CXL Endpoint examples are included below.
> >>>
> >>> Root Port:
> >>> cxl_aer_correctable_error: device=0000:0c:00.0 parent=pci0000:0c serial: 0
> >>> status='CRC Threshold Hit'
> >>> cxl_aer_uncorrectable_error: device=0000:0c:00.0 parent=pci0000:0c serial: 0
> >>> status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity
> >>> Error'
> >>>
> >>> Endpoint:
> >>> cxl_aer_correctable_error: device=mem3 parent=0000:0f:00.0 serial=0
> >>> status='CRC Threshold Hit'
> >>> cxl_aer_uncorrectable_error: device=mem3 parent=0000:0f:00.0 serial: 0
> >>> status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity
> >>> Error'
> >>>
> >>> Signed-off-by: Terry Bowman <terry.bowman@....com>
> >>> ---
> >>> drivers/cxl/core/pci.c   | 18 +++++----
> >>> drivers/cxl/core/ras.c   | 14 ++++---
> >>> drivers/cxl/core/trace.h | 84 +++++++++-------------------------------
> >>> 3 files changed, 37 insertions(+), 79 deletions(-)
> >>>
> >>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index
> >>> 186a5a20b951..0f4c07fd64a5 100644
> >>> --- a/drivers/cxl/core/pci.c
> >>> +++ b/drivers/cxl/core/pci.c
> >>> @@ -664,7 +664,7 @@ void read_cdat_data(struct cxl_port *port)  }
> >>> EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
> >>>  
> >> [...]  
> >>> static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data
> >>> *data) diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index
> >>> 25ebfbc1616c..8c91b0f3d165 100644
> >>> --- a/drivers/cxl/core/trace.h
> >>> +++ b/drivers/cxl/core/trace.h
> >>> @@ -48,49 +48,22 @@
> >>> 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
> >>> )
> >>>
> >>> -TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> >>> -	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> >>> -	TP_ARGS(dev, status, fe, hl),
> >>> -	TP_STRUCT__entry(
> >>> -		__string(device, dev_name(dev))
> >>> -		__string(host, dev_name(dev->parent))
> >>> -		__field(u32, status)
> >>> -		__field(u32, first_error)
> >>> -		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> >>> -	),
> >>> -	TP_fast_assign(
> >>> -		__assign_str(device);
> >>> -		__assign_str(host);
> >>> -		__entry->status = status;
> >>> -		__entry->first_error = fe;
> >>> -		/*
> >>> -		 * Embed the 512B headerlog data for user app retrieval and
> >>> -		 * parsing, but no need to print this in the trace buffer.
> >>> -		 */
> >>> -		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> >>> -	),
> >>> -	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> >>> -		  __get_str(device), __get_str(host),
> >>> -		  show_uc_errs(__entry->status),
> >>> -		  show_uc_errs(__entry->first_error)
> >>> -	)
> >>> -);
> >>> -
> >>> TRACE_EVENT(cxl_aer_uncorrectable_error,
> >>> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32
> >>> *hl),
> >>> -	TP_ARGS(cxlmd, status, fe, hl),
> >>> +	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
> >>> +		 u32 *hl),
> >>> +	TP_ARGS(dev, serial, status, fe, hl),
> >>> 	TP_STRUCT__entry(
> >>> -		__string(memdev, dev_name(&cxlmd->dev))
> >>> -		__string(host, dev_name(cxlmd->dev.parent))
> >>> +		__string(name, dev_name(dev))
> >>> +		__string(parent, dev_name(dev->parent))  
> >> Hi Terry,
> >>
> >> As we pointed out in v8, renaming the fields "memdev" to "name" and "host" to "parent"
> >> causes issues and failures in userspace rasdaemon  while parsing the trace event data.
> >> Additionally, we can't rename these fields in rasdaemon  due to backward compatibility.  
> > Yes, I remember but didn't understand why other SW couldn't be updated to handle. I will
> > change as you request but many people will be confused why a port device's name is labeled
> > as a memdev. memdev is only correct for EPs and does not correctly reflect *any* of the
> > other CXL device types (RP, USP, DSP).

RAS trace points are ABI so you can introduce whatever you like that is new, but you can't
remove or rename existing elements. If userspace breaks (which Shiju is confirming it will)
and anyone reports the regression, then you get grumpy Linus.

Test this stuff against upstream rasdaemon. It's easy and means things
like this don't sneak in - various cloud vendors have stated it's what their
stacks are based on, so they will see this in production if they get a mismatch.

https://github.com/mchehab/rasdaemon/blob/master/ras-cxl-handler.c#L354

It think this will error out if an AER UE does not contain the memdev field.

So your options are a new tracepoint, or leave those fields in place and just
let them contain whatever is most useful. If you go with new tracepoint you
still should generate the old one as well if it is enabled.  Note that
even adding fields can be a bit painful as it requires some DB restructuring
on an upgrade of rasdaemon (or dumping existing logs and starting from scratch).

+CC Mauro in case I've missed any other ways to deal with this change.


> >  
> >>> 		__field(u64, serial)
> >>> 		__field(u32, status)
> >>> 		__field(u32, first_error)
> >>> 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> >>> 	),
> >>> 	TP_fast_assign(
> >>> -		__assign_str(memdev);
> >>> -		__assign_str(host);
> >>> -		__entry->serial = cxlmd->cxlds->serial;
> >>> +		__assign_str(name);
> >>> +		__assign_str(parent);
> >>> +		__entry->serial = serial;
> >>> 		__entry->status = status;
> >>> 		__entry->first_error = fe;
> >>> 		/*
> >>> @@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> >>> 		 */
> >>> 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> >>> 	),
> >>> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error:
> >>> '%s'",
> >>> -		  __get_str(memdev), __get_str(host), __entry->serial,
> >>> +	TP_printk("device=%s parent=%s serial=%lld status='%s'
> >>> first_error='%s'",
> >>> +		  __get_str(name), __get_str(parent), __entry->serial,
> >>> 		  show_uc_errs(__entry->status),
> >>> 		  show_uc_errs(__entry->first_error)
> >>> 	)
> >>> @@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> >>> 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer"
> >>> }	\
> >>> )
> >>>
> >>> -TRACE_EVENT(cxl_port_aer_correctable_error,
> >>> -	TP_PROTO(struct device *dev, u32 status),
> >>> -	TP_ARGS(dev, status),
> >>> -	TP_STRUCT__entry(
> >>> -		__string(device, dev_name(dev))
> >>> -		__string(host, dev_name(dev->parent))
> >>> -		__field(u32, status)
> >>> -	),
> >>> -	TP_fast_assign(
> >>> -		__assign_str(device);
> >>> -		__assign_str(host);
> >>> -		__entry->status = status;
> >>> -	),
> >>> -	TP_printk("device=%s host=%s status='%s'",
> >>> -		  __get_str(device), __get_str(host),
> >>> -		  show_ce_errs(__entry->status)
> >>> -	)
> >>> -);
> >>> -
> >>> TRACE_EVENT(cxl_aer_correctable_error,
> >>> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
> >>> -	TP_ARGS(cxlmd, status),
> >>> +	TP_PROTO(struct device *dev, u64 serial, u32 status),
> >>> +	TP_ARGS(dev, serial, status),
> >>> 	TP_STRUCT__entry(
> >>> -		__string(memdev, dev_name(&cxlmd->dev))
> >>> -		__string(host, dev_name(cxlmd->dev.parent))
> >>> +		__string(name, dev_name(dev))
> >>> +		__string(parent, dev_name(dev->parent))  
> >> Renaming these fields is an issue for userspace as mentioned above 
> >> in cxl_aer_uncorrectable_error.  
> > I understand, I'll revert as you request.
> >
> > Terry  
> 
> I'll update the commit message with explanation for leaving as-is.
> 
> Terry
> >>> 		__field(u64, serial)
> >>> 		__field(u32, status)
> >>> 	),
> >>> 	TP_fast_assign(
> >>> -		__assign_str(memdev);
> >>> -		__assign_str(host);
> >>> -		__entry->serial = cxlmd->cxlds->serial;
> >>> +		__assign_str(name);
> >>> +		__assign_str(parent);
> >>> +		__entry->serial = serial;
> >>> 		__entry->status = status;
> >>> 	),
> >>> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
> >>> -		  __get_str(memdev), __get_str(host), __entry->serial,
> >>> +	TP_printk("device=%s parent=%s serial=%lld status='%s'",
> >>> +		  __get_str(name), __get_str(parent), __entry->serial,
> >>> 		  show_ce_errs(__entry->status)
> >>> 	)
> >>> );
> >>> --
> >>> 2.34.1  
> >> Thanks,
> >> Shiju  
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ