linux-kernel - Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using ACPI PRMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4pj1S4H-MGf-nFW@rric.localdomain>
Date: Fri, 17 Jan 2025 15:06:13 +0100
From: Robert Richter <rrichter@....com>
To: Gregory Price <gourry@...rry.net>
Cc: Alison Schofield <alison.schofield@...el.com>,
	Vishal Verma <vishal.l.verma@...el.com>,
	Ira Weiny <ira.weiny@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Jonathan Cameron <jonathan.cameron@...wei.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Davidlohr Bueso <dave@...olabs.net>,
	Terry Bowman <terry.bowman@....com>, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	"Fabio M. De Francesco" <fabio.m.de.francesco@...ux.intel.com>
Subject: Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using
 ACPI PRMT

On 15.01.25 17:24:54, Gregory Price wrote:
> On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> > On 09.01.25 17:25:13, Gregory Price wrote:
> > > > +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> > > > +		+ (hpa & (granularity - 1));
> > > 
> > > I do not understand this chunk here, we seem to just be chopping the HPA
> > > in half to acquire the DPA.  But the value passed in is already a DPA.
> > > 
> > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> > >     =  0xfffffffff
> > 
> > HPA is:
> > 
> >  HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
> > 
> > Should calculate for a 2-way config to:
> > 
> >  DPA = 0x1fffffffff.
> > 
> 
> I'm looking back through all of this again, and I'm not seeing how the
> current code is ever capable of ending up with hpa=0x3fffffffff.
> 
> Taking an example endpoint in my setup:
> 
> [decoder5.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000   <- 128GB (half the total 256GB interleaved range)
> 1              <- this decoder does not apply interleave
> 256
> 
> translating up to a root decoder:
> 
> [decoder0.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000   <- 256GB (total interleaved capacity)
> 2              <- interleaved 2 ways, this decoder applies interleave
> 256
> 
> 
> Now looking at the code that actually invokes the translation
> 
> static struct device *
> cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled,
>                       struct cxl_region *cxlr)
> {
>         struct cxl_decoder *cxld = &cxled->cxld;
>         struct range hpa = cxld->hpa_range;
> 	... snip ...
> 	if (cxl_port_calc_hpa(parent, cxld, &hpa))
> 		return NULL;
> }
> 
> or
> 
> static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled)
> {
>         struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
>         struct cxl_port *parent, *iter = cxled_to_port(cxled);
>         struct range hpa = cxled->cxld.hpa_range;
>         struct cxl_decoder *cxld = &cxled->cxld;
> 
>         while (1) {
>                 if (is_cxl_endpoint(iter))
>                         cxld = &cxled->cxld;
> 		...
>                 /* Translate HPA to the next upper memory domain. */
>                 if (cxl_port_calc_hpa(parent, cxld, &hpa)) {
> 		}
> 		...
> 	}
> 	....
> }
> 
> Both of these will call cxl_port_calc_hpa with
> 	hpa = [0, 0x1fffffffff]
> 
> Resulting in the following
> 
> static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld,
>                              struct range *hpa_range)
> {
> ...
>         /* Translate HPA to the next upper domain. */
>         hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0
>         hpa.end = port->to_hpa(cxld, hpa.end);     <---- 0x1fffffffff
> }
> 
> So we call:
>     to_hpa(decoder5.0, hpa.end)
>     to_hpa(decoder5.0, 0x1fffffffff)
>                        ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff

Depending on the endpoint the PRM call returns the following here
(2-way interleaving with 256 gran):

 hpa = 0x1fffffff00 * 2 + pos * 0x100 + 0xff;

It will either return 0x3ffffffeff or 0x3fffffffff.

But implementation in cxl_zen5_to_hpa() is not correct, see below.

> 
> 
> Should the to_hpa() code be taking an decoder length as an argument?
> 
> to_hpa(decoder5.0, range_length, addr) ?
> 
> This would actually let us calculate the end of region with the
> interleave ways and granularity:
> 
> upper_hpa_base  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC;
> upper_hpa_end   = upper_hpa_base + (range_length * ways) - 1
> 
> Without this, you don't have enough information to actually calculate
> the upper_hpa_end as you suggested.  The result is the math ends up
> chopping the endpoint decoder's range (128GB) in half (64GB).
> 
> Below I walk through the translation code with these inputs step by step.
> 
> ~Gregory
> 
> ------------------------------------------------------------------------
> 
> Walking through the translation code by hand here:
> 
> [decoder5.0]# cat start size
> 0x0
> 0x2000000000
> 
> call:  to_hpa(decoder5.0, 0x1fffffffff)
> 
> --- code
>     #define DPA_MAGIC       0xd20000
>     base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
>     spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
>     spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
> 
>     len = spa - base;
>     len2 = spa2 - base;
> 
>     /* offset = pos * granularity */
>     if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
>         ... snip - not taken ...
>     } else {
>         ways = len / SZ_16K;
>         offset = spa & (SZ_16K - 1);
>         granularity = (len - len2 - SZ_256) / (ways - 1);
>         pos = offset / granularity;
>     }
> --- end code
> 
> At this point in the code i have the following values:
> 
> base   = 0xc051a40100
> spa    = 0xc051a48100
> spa2   = 0xc051a47f00
> len    = 0x8000
> len2   = 0x7E00
> ways   = 2
> offset = 256
> granularity = 256
> pos    = 1
> 

> --- code
>         base = base - DPA_MAGIC * ways - pos * granularity;
>         spa = base + hpa;
> --- end code
> 
> base   = 0xc051a40100 - 0xd20000 * 2 - 1 * 256
>        = 0xc050000000
> 
> spa    = base + hpa

This is the wrong part, the interleaving parameters must be considered
for SPA too, like:

  spa = base + hpa div gran * gran * ways + pos * gran + hpa mod gran

This caused a wrong range size and this went unnoticed due to an
overlaying issue while testing. The decection of the interleaving
config still is correct.

>        = 0xc050000000 + 0x1fffffffff  <-----
>        = 0xe04fffffff                      |
> 
>          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	 Isn't this just incorrect? Should it be something like:

> --------------------------------------------------------------------
> 
> For the sake of completeness, here is my PRMT emulation code to show
> you that it is doing the translation as-expected.
> 
> All this does is just force translation for a particular set of PCI
> devices based on the known static CFMW regions. 

> @@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
>                 .out     = &spa,
>         };
> 
> +       cfmws_nr = cxl_get_num_cfmws();
> +       if (!cfmws_nr)
> +               goto try_prmt;
> +
> +       /* HACK: Calculate the interleaved offset and find the matching base */
> +       if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1)
> +               goto try_prmt;
> +
> +       dev = pci_dev->bus->number == 0xe1 ? 0 : 1;
> +       offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff);

That looks correct to me, a test too:

 >>> dpa = 0x1fffffffff
 >>> dev = 1
 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
 >>> "0x%x" % offset
 '0x3fffffffff'
 >>> dev = 0
 >>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
 >>> "0x%x" % offset
 '0x3ffffffeff'

Thanks for review and testing.

Will fix in v2.

-Robert

> +
> +       for (idx = 0; idx < cfmws_nr; idx++) {
> +               size = cxl_get_cfmws_size(idx);
> +               if (offset < size) {
> +                       spa = cxl_get_cfmws_base(idx) + offset;
> +                       goto out;
> +               }
> +               offset -= size;
> +       }
> +       /* We failed, fall back to calling the PRMT */
> +try_prmt:
> +
>         rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
>         if (rc) {
>                 pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
>                 return ULLONG_MAX;
>         }
> 
> +out:
>         pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
> 
>         return spa;