linux-kernel - Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using ACPI PRMT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4g1tmSqsHmClPiw@gourry-fedora-PF4VCD3F>
Date: Wed, 15 Jan 2025 17:24:54 -0500
From: Gregory Price <gourry@...rry.net>
To: Robert Richter <rrichter@....com>
Cc: Alison Schofield <alison.schofield@...el.com>,
	Vishal Verma <vishal.l.verma@...el.com>,
	Ira Weiny <ira.weiny@...el.com>,
	Dan Williams <dan.j.williams@...el.com>,
	Jonathan Cameron <jonathan.cameron@...wei.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Davidlohr Bueso <dave@...olabs.net>,
	Terry Bowman <terry.bowman@....com>, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	"Fabio M. De Francesco" <fabio.m.de.francesco@...ux.intel.com>
Subject: Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using
 ACPI PRMT

On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> On 09.01.25 17:25:13, Gregory Price wrote:
> > > +	dpa = (hpa & ~(granularity * ways - 1)) / ways
> > > +		+ (hpa & (granularity - 1));
> > 
> > I do not understand this chunk here, we seem to just be chopping the HPA
> > in half to acquire the DPA.  But the value passed in is already a DPA.
> > 
> > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> >     =  0xfffffffff
> 
> HPA is:
> 
>  HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
> 
> Should calculate for a 2-way config to:
> 
>  DPA = 0x1fffffffff.
> 

I'm looking back through all of this again, and I'm not seeing how the
current code is ever capable of ending up with hpa=0x3fffffffff.

Taking an example endpoint in my setup:

[decoder5.0]# cat start size interleave_ways interleave_granularity
0x0
0x2000000000   <- 128GB (half the total 256GB interleaved range)
1              <- this decoder does not apply interleave
256

translating up to a root decoder:

[decoder0.0]# cat start size interleave_ways interleave_granularity
0xc050000000
0x4000000000   <- 256GB (total interleaved capacity)
2              <- interleaved 2 ways, this decoder applies interleave
256


Now looking at the code that actually invokes the translation

static struct device *
cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled,
                      struct cxl_region *cxlr)
{
        struct cxl_decoder *cxld = &cxled->cxld;
        struct range hpa = cxld->hpa_range;
	... snip ...
	if (cxl_port_calc_hpa(parent, cxld, &hpa))
		return NULL;
}

or

static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled)
{
        struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
        struct cxl_port *parent, *iter = cxled_to_port(cxled);
        struct range hpa = cxled->cxld.hpa_range;
        struct cxl_decoder *cxld = &cxled->cxld;

        while (1) {
                if (is_cxl_endpoint(iter))
                        cxld = &cxled->cxld;
		...
                /* Translate HPA to the next upper memory domain. */
                if (cxl_port_calc_hpa(parent, cxld, &hpa)) {
		}
		...
	}
	....
}

Both of these will call cxl_port_calc_hpa with
	hpa = [0, 0x1fffffffff]

Resulting in the following

static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld,
                             struct range *hpa_range)
{
...
        /* Translate HPA to the next upper domain. */
        hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0
        hpa.end = port->to_hpa(cxld, hpa.end);     <---- 0x1fffffffff
}

So we call:
    to_hpa(decoder5.0, hpa.end)
    to_hpa(decoder5.0, 0x1fffffffff)
                       ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff


Should the to_hpa() code be taking an decoder length as an argument?

to_hpa(decoder5.0, range_length, addr) ?

This would actually let us calculate the end of region with the
interleave ways and granularity:

upper_hpa_base  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC;
upper_hpa_end   = upper_hpa_base + (range_length * ways) - 1

Without this, you don't have enough information to actually calculate
the upper_hpa_end as you suggested.  The result is the math ends up
chopping the endpoint decoder's range (128GB) in half (64GB).

Below I walk through the translation code with these inputs step by step.

~Gregory

------------------------------------------------------------------------

Walking through the translation code by hand here:

[decoder5.0]# cat start size
0x0
0x2000000000

call:  to_hpa(decoder5.0, 0x1fffffffff)

--- code
    #define DPA_MAGIC       0xd20000
    base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
    spa  = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
    spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);

    len = spa - base;
    len2 = spa2 - base;

    /* offset = pos * granularity */
    if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
        ... snip - not taken ...
    } else {
        ways = len / SZ_16K;
        offset = spa & (SZ_16K - 1);
        granularity = (len - len2 - SZ_256) / (ways - 1);
        pos = offset / granularity;
    }
--- end code

At this point in the code i have the following values:

base   = 0xc051a40100
spa    = 0xc051a48100
spa2   = 0xc051a47f00
len    = 0x8000
len2   = 0x7E00
ways   = 2
offset = 256
granularity = 256
pos    = 1

--- code
        base = base - DPA_MAGIC * ways - pos * granularity;
        spa = base + hpa;
--- end code

base   = 0xc051a40100 - 0xd20000 * 2 - 1 * 256
       = 0xc050000000

spa    = base + hpa
       = 0xc050000000 + 0x1fffffffff  <-----
       = 0xe04fffffff                      |

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	 Isn't this just incorrect? Should it be something like:

	 base + ((hpa & ~(granularity - 1)) * pos)
	      + (hpa & (ways * granularity - 1))

--- code
        dpa = (hpa & ~(granularity * ways - 1)) / ways
                + (hpa & (granularity - 1));
--- end code

dpa = (hpa & ~(granularity * ways - 1)) / ways + (hpa & (granularity - 1));
       ^^^                                ^^^
       0x1fffffffff                     /  2

       We are dividing the endpoint decoder HPA by interleave ways.
       This is how we end up with the truncated size.

 
      Full math
dpa   = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
      = (0x1fffffffff & (0xF...E00) / 2 + (0x1fffffffff & 0xFF)
      = (0x1ffffffe00) / 2 + (0xFF)
      = 0xfffffff00 + 0xff
      = 0xfffffffff

--- code
        offset = hpa & (granularity * ways - 1) & ~(granularity - 1);
        offset -= pos * granularity;
---
offset = (0x1fffffffff & (256 * 2 - 1) & ~(256 - 1)) - (1 * 256)
         (0x1fffffffff & 0x1ff & 0xffffffffffffff00) - 0x100
         (0x1ff & 0xffffffffffffff00) - 0x100
	 0x100 - 0x100
	 0x0


Final Result:
--- code
        spa2 = prm_cxl_dpa_spa(pci_dev, dpa) + offset;
---
     = prm_cxl_dpa_spa(pci_dev, 0xfffffffff) + 0
     = 0xe04fffffff + 0
       ^^^ note that in thise case my emulation gives the exact address
           you seem to suggest that i'll get the closet granularity

	   So if not for my emulation, the offset calculation is wrong?


--------------------------------------------------------------------

For the sake of completeness, here is my PRMT emulation code to show
you that it is doing the translation as-expected.

All this does is just force translation for a particular set of PCI
devices based on the known static CFMW regions. 

Note for onlookers: This patch is extremely dangerous and only applies
to my specific system / interleave configuration.


diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
index ac74b6f6dad7..8ccf2d5638ed 100644
--- a/drivers/cxl/acpi.c
+++ b/drivers/cxl/acpi.c
@@ -432,6 +432,31 @@ static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
        return 0;
 }

+#define MAX_CFMWS (32)
+static unsigned int num_cfmws;
+static unsigned long cfmws_bases[MAX_CFMWS];
+static unsigned long cfmws_sizes[MAX_CFMWS];
+unsigned int cxl_get_num_cfmws(void)
+{
+       return num_cfmws;
+}
+
+unsigned long cxl_get_cfmws_base(unsigned int idx)
+{
+       if (idx >= MAX_CFMWS || idx >= num_cfmws)
+               return ~0;
+
+       return cfmws_bases[idx];
+}
+
+unsigned long cxl_get_cfmws_size(unsigned int idx)
+{
+       if (idx >= MAX_CFMWS || idx >= num_cfmws)
+               return ~0;
+
+       return cfmws_sizes[idx];
+}
+
 static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
                           const unsigned long end)
 {
@@ -446,10 +471,16 @@ static int cxl_parse_cfmws(union acpi_subtable_headers *header, void *arg,
                        "Failed to add decode range: [%#llx - %#llx] (%d)\n",
                        cfmws->base_hpa,
                        cfmws->base_hpa + cfmws->window_size - 1, rc);
-       else
+       else {
                dev_dbg(dev, "decode range: node: %d range [%#llx - %#llx]\n",
                        phys_to_target_node(cfmws->base_hpa), cfmws->base_hpa,
                        cfmws->base_hpa + cfmws->window_size - 1);
+               if (num_cfmws < MAX_CFMWS) {
+                       cfmws_bases[num_cfmws] = cfmws->base_hpa;
+                       cfmws_sizes[num_cfmws] = cfmws->window_size;
+                       num_cfmws++;
+               }
+       }

        /* never fail cxl_acpi load for a single window failure */
        return 0;
diff --git a/drivers/cxl/core/amd.c b/drivers/cxl/core/amd.c
index 553b7d0caefd..08a5bfb9fbd6 100644
--- a/drivers/cxl/core/amd.c
+++ b/drivers/cxl/core/amd.c
@@ -64,6 +64,10 @@ struct prm_cxl_dpa_spa_data {
 static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
 {
        struct prm_cxl_dpa_spa_data data;
+       unsigned int cfmws_nr;
+       unsigned int idx;
+       unsigned long offset, size;
+       unsigned int dev;
        u64 spa;
        int rc;

@@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
                .out     = &spa,
        };

+       cfmws_nr = cxl_get_num_cfmws();
+       if (!cfmws_nr)
+               goto try_prmt;
+
+       /* HACK: Calculate the interleaved offset and find the matching base */
+       if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1)
+               goto try_prmt;
+
+       dev = pci_dev->bus->number == 0xe1 ? 0 : 1;
+       offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff);
+
+       for (idx = 0; idx < cfmws_nr; idx++) {
+               size = cxl_get_cfmws_size(idx);
+               if (offset < size) {
+                       spa = cxl_get_cfmws_base(idx) + offset;
+                       goto out;
+               }
+               offset -= size;
+       }
+       /* We failed, fall back to calling the PRMT */
+try_prmt:
+
        rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
        if (rc) {
                pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
                return ULLONG_MAX;
        }

+out:
        pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);

        return spa;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f93eb464fc97..48cbfa68d739 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -919,6 +919,10 @@ bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);

 struct cxl_mem_command *cxl_find_feature_command(u16 opcode);

+unsigned int cxl_get_num_cfmws(void);
+unsigned long cxl_get_cfmws_base(unsigned int idx);
+unsigned long cxl_get_cfmws_size(unsigned int idx);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.