lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZhRii1BAg86gccWM@snoopy>
Date: Mon, 8 Apr 2024 14:32:59 -0700
From: PJ Waskiewicz <ppwaskie@...nel.org>
To: Dan Williams <dan.j.williams@...el.com>
Cc: Jonathan Cameron <Jonathan.Cameron@...wei.com>,
	Lukas Wunner <lukas@...ner.de>, linux-cxl@...r.kernel.org,
	linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/1] cxl/acpi.c: Add buggy BIOS hint for CXL ACPI lookup
 failure

On 24/04/08 01:45PM, Dan Williams wrote:
> PJ Waskiewicz wrote:
> [..]
> > > Other than that seems reasonable to hint it is probably a bios
> > > bug - however I wonder how many other cases we should do this for and
> > > whether it is worth the effort of marking them all?
> > 
> > I can confirm this was definitely a BIOS bug in this particular case.
> > The vendor spun a quick test BIOS for us to test on an EMR and SPR host,
> > and the _UID's were finally correct.  I could successfully walk the CEDT
> > and get to the CAPS structs I was after (link speed, bus width, etc.).
> 
> Oh, in that case I think there's no need to worry about a Linux quirk.

Copy that.
 
> I do think cxl_acpi has multiple overlapping failure cases when what you
> really want to know is whether:
> 
>    "CXL host bridge can be found in CEDT.CHBS"

Yes.
 
> Turns out that straightforward message is aleady a driver message, but
> it gets skipped in this case. So, I am thinking of cleanup /
> clarification along the following lines:
> 
> 1/ Lean on the existing cxl_get_chbs() validation paths to report on
> errors
> 
> 2/ Include the device-name rather than the UID since if UID is
> unreliable it does not help you communicate with your BIOS vendor. I.e.
> give a breadcrumb for the BIOS engineer to match the AML device name
> with the CEDT content.
> 
> 3/ Do not fail driver load on a single host-bridge parsing failure
> 
> 4/ These are all cxl_acpi driver init events, so consistently use the
> ACPI0017 device, and the cxl_acpi driver, as the originator of the error
> message.
> 
> Would this clarification have saved you time with the debug?

Inline below, but yes, this would have helped sped up the debug quite a
bit.  Since the initial debug was happening on SPR, and I couldn't use
the host drivers, I was pulling the logic from the CXL host drivers into
my drivers to skip the need for ACPI0017.
 
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index 32091379a97b..5a70d7312c64 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -511,29 +511,26 @@ static int cxl_get_chbs_iter(union acpi_subtable_headers *header, void *arg,
>  	return 0;
>  }
>  
> -static int cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> -			struct cxl_chbs_context *ctx)
> +static void cxl_get_chbs(struct device *dev, struct acpi_device *hb,
> +			 struct cxl_chbs_context *ctx)
>  {
> -	unsigned long long uid;
>  	int rc;
>  
> -	rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL, &uid);
> -	if (rc != AE_OK) {
> -		dev_err(dev, "unable to retrieve _UID\n");
> -		return -ENOENT;
> -	}
> -
> -	dev_dbg(dev, "UID found: %lld\n", uid);
> -	*ctx = (struct cxl_chbs_context) {
> +	*ctx = (struct cxl_chbs_context){
>  		.dev = dev,
> -		.uid = uid,
>  		.base = CXL_RESOURCE_NONE,
>  		.cxl_version = UINT_MAX,
>  	};
>  
> -	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
> +	rc = acpi_evaluate_integer(hb->handle, METHOD_NAME__UID, NULL,
> +				   &ctx->uid);
> +	if (rc != AE_OK) {
> +		dev_dbg(dev, "unable to retrieve _UID\n");
> +		return;
> +	}

Before I read the changes below, I didn't see why this was still
useful...but I do see it now in full context.

>  
> -	return 0;
> +	dev_dbg(dev, "UID found: %lld\n", ctx->uid);
> +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CHBS, cxl_get_chbs_iter, ctx);
>  }
>  
>  static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
> @@ -561,7 +558,6 @@ static int get_genport_coordinates(struct device *dev, struct cxl_dport *dport)
>  static int add_host_bridge_dport(struct device *match, void *arg)
>  {
>  	int ret;
> -	acpi_status rc;
>  	struct device *bridge;
>  	struct cxl_dport *dport;
>  	struct cxl_chbs_context ctx;
> @@ -573,19 +569,16 @@ static int add_host_bridge_dport(struct device *match, void *arg)
>  	if (!hb)
>  		return 0;
>  
> -	rc = cxl_get_chbs(match, hb, &ctx);
> -	if (rc)
> -		return rc;
> -
> +	cxl_get_chbs(match, hb, &ctx);
>  	if (ctx.cxl_version == UINT_MAX) {
> -		dev_warn(match, "No CHBS found for Host Bridge (UID %lld)\n",
> -			 ctx.uid);
> +		dev_err(host, FW_BUG "No CHBS found for Host Bridge (%s)\n",
> +			dev_name(match));
>  		return 0;
>  	}

This would have been more obvious that the lookup failed for "reasons"
instead of AE_BUFFER_OVERFLOW (which led to the fun Alice-style LXR
spelunking).

>  
>  	if (ctx.base == CXL_RESOURCE_NONE) {
> -		dev_warn(match, "CHBS invalid for Host Bridge (UID %lld)\n",
> -			 ctx.uid);
> +		dev_err(host, FW_BUG "CHBS invalid for Host Bridge (%s)\n",
> +			dev_name(match));

I think this is a good catch-all too in case the lookup is good, but the
vendor borked filling it in properly.

>  		return 0;
>  	}
>  
> @@ -650,13 +643,11 @@ static int add_host_bridge_uport(struct device *match, void *arg)
>  		return 0;
>  	}
>  
> -	rc = cxl_get_chbs(match, hb, &ctx);
> -	if (rc)
> -		return rc;
> -
> +	cxl_get_chbs(match, hb, &ctx);
>  	if (ctx.cxl_version == ACPI_CEDT_CHBS_VERSION_CXL11) {
> -		dev_warn(bridge,
> -			 "CXL CHBS version mismatch, skip port registration\n");
> +		dev_err(host,
> +			FW_BUG "CXL CHBS version mismatch, skip port registration for %s\n",
> +			dev_name(match));
>  		return 0;
>  	}

I like your version here much better than mine.  If you want to merge
that, then:

Reviewed-by: PJ Waskiewicz <ppwaskie@...nel.org>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ