[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4pj1S4H-MGf-nFW@rric.localdomain>
Date: Fri, 17 Jan 2025 15:06:13 +0100
From: Robert Richter <rrichter@....com>
To: Gregory Price <gourry@...rry.net>
Cc: Alison Schofield <alison.schofield@...el.com>,
Vishal Verma <vishal.l.verma@...el.com>,
Ira Weiny <ira.weiny@...el.com>,
Dan Williams <dan.j.williams@...el.com>,
Jonathan Cameron <jonathan.cameron@...wei.com>,
Dave Jiang <dave.jiang@...el.com>,
Davidlohr Bueso <dave@...olabs.net>,
Terry Bowman <terry.bowman@....com>, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org,
"Fabio M. De Francesco" <fabio.m.de.francesco@...ux.intel.com>
Subject: Re: [PATCH v1 25/29] cxl/amd: Enable Zen5 address translation using
ACPI PRMT
On 15.01.25 17:24:54, Gregory Price wrote:
> On Wed, Jan 15, 2025 at 04:05:16PM +0100, Robert Richter wrote:
> > On 09.01.25 17:25:13, Gregory Price wrote:
> > > > + dpa = (hpa & ~(granularity * ways - 1)) / ways
> > > > + + (hpa & (granularity - 1));
> > >
> > > I do not understand this chunk here, we seem to just be chopping the HPA
> > > in half to acquire the DPA. But the value passed in is already a DPA.
> > >
> > > dpa = (0x1fffffffff & ~(256 * 2 - 1)) / 2 + (0x1fffffffff & (256 - 1))
> > > = 0xfffffffff
> >
> > HPA is:
> >
> > HPA = 2 * 0x2000000000 - 1 = 0x3fffffffff
> >
> > Should calculate for a 2-way config to:
> >
> > DPA = 0x1fffffffff.
> >
>
> I'm looking back through all of this again, and I'm not seeing how the
> current code is ever capable of ending up with hpa=0x3fffffffff.
>
> Taking an example endpoint in my setup:
>
> [decoder5.0]# cat start size interleave_ways interleave_granularity
> 0x0
> 0x2000000000 <- 128GB (half the total 256GB interleaved range)
> 1 <- this decoder does not apply interleave
> 256
>
> translating up to a root decoder:
>
> [decoder0.0]# cat start size interleave_ways interleave_granularity
> 0xc050000000
> 0x4000000000 <- 256GB (total interleaved capacity)
> 2 <- interleaved 2 ways, this decoder applies interleave
> 256
>
>
> Now looking at the code that actually invokes the translation
>
> static struct device *
> cxl_find_auto_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled,
> struct cxl_region *cxlr)
> {
> struct cxl_decoder *cxld = &cxled->cxld;
> struct range hpa = cxld->hpa_range;
> ... snip ...
> if (cxl_port_calc_hpa(parent, cxld, &hpa))
> return NULL;
> }
>
> or
>
> static int cxl_endpoint_initialize(struct cxl_endpoint_decoder *cxled)
> {
> struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
> struct cxl_port *parent, *iter = cxled_to_port(cxled);
> struct range hpa = cxled->cxld.hpa_range;
> struct cxl_decoder *cxld = &cxled->cxld;
>
> while (1) {
> if (is_cxl_endpoint(iter))
> cxld = &cxled->cxld;
> ...
> /* Translate HPA to the next upper memory domain. */
> if (cxl_port_calc_hpa(parent, cxld, &hpa)) {
> }
> ...
> }
> ....
> }
>
> Both of these will call cxl_port_calc_hpa with
> hpa = [0, 0x1fffffffff]
>
> Resulting in the following
>
> static int cxl_port_calc_hpa(struct cxl_port *port, struct cxl_decoder *cxld,
> struct range *hpa_range)
> {
> ...
> /* Translate HPA to the next upper domain. */
> hpa.start = port->to_hpa(cxld, hpa.start); <---- 0x0
> hpa.end = port->to_hpa(cxld, hpa.end); <---- 0x1fffffffff
> }
>
> So we call:
> to_hpa(decoder5.0, hpa.end)
> to_hpa(decoder5.0, 0x1fffffffff)
> ^^^^^^^^^^^^^ --- hpa will never be 0x3fffffffff
Depending on the endpoint the PRM call returns the following here
(2-way interleaving with 256 gran):
hpa = 0x1fffffff00 * 2 + pos * 0x100 + 0xff;
It will either return 0x3ffffffeff or 0x3fffffffff.
But implementation in cxl_zen5_to_hpa() is not correct, see below.
>
>
> Should the to_hpa() code be taking an decoder length as an argument?
>
> to_hpa(decoder5.0, range_length, addr) ?
>
> This would actually let us calculate the end of region with the
> interleave ways and granularity:
>
> upper_hpa_base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC) - DPA_MAGIC;
> upper_hpa_end = upper_hpa_base + (range_length * ways) - 1
>
> Without this, you don't have enough information to actually calculate
> the upper_hpa_end as you suggested. The result is the math ends up
> chopping the endpoint decoder's range (128GB) in half (64GB).
>
> Below I walk through the translation code with these inputs step by step.
>
> ~Gregory
>
> ------------------------------------------------------------------------
>
> Walking through the translation code by hand here:
>
> [decoder5.0]# cat start size
> 0x0
> 0x2000000000
>
> call: to_hpa(decoder5.0, 0x1fffffffff)
>
> --- code
> #define DPA_MAGIC 0xd20000
> base = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC);
> spa = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K);
> spa2 = prm_cxl_dpa_spa(pci_dev, DPA_MAGIC + SZ_16K - SZ_256);
>
> len = spa - base;
> len2 = spa2 - base;
>
> /* offset = pos * granularity */
> if (len == SZ_16K && len2 == SZ_16K - SZ_256) {
> ... snip - not taken ...
> } else {
> ways = len / SZ_16K;
> offset = spa & (SZ_16K - 1);
> granularity = (len - len2 - SZ_256) / (ways - 1);
> pos = offset / granularity;
> }
> --- end code
>
> At this point in the code i have the following values:
>
> base = 0xc051a40100
> spa = 0xc051a48100
> spa2 = 0xc051a47f00
> len = 0x8000
> len2 = 0x7E00
> ways = 2
> offset = 256
> granularity = 256
> pos = 1
>
> --- code
> base = base - DPA_MAGIC * ways - pos * granularity;
> spa = base + hpa;
> --- end code
>
> base = 0xc051a40100 - 0xd20000 * 2 - 1 * 256
> = 0xc050000000
>
> spa = base + hpa
This is the wrong part, the interleaving parameters must be considered
for SPA too, like:
spa = base + hpa div gran * gran * ways + pos * gran + hpa mod gran
This caused a wrong range size and this went unnoticed due to an
overlaying issue while testing. The decection of the interleaving
config still is correct.
> = 0xc050000000 + 0x1fffffffff <-----
> = 0xe04fffffff |
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Isn't this just incorrect? Should it be something like:
> --------------------------------------------------------------------
>
> For the sake of completeness, here is my PRMT emulation code to show
> you that it is doing the translation as-expected.
>
> All this does is just force translation for a particular set of PCI
> devices based on the known static CFMW regions.
> @@ -75,12 +79,35 @@ static u64 prm_cxl_dpa_spa(struct pci_dev *pci_dev, u64 dpa)
> .out = &spa,
> };
>
> + cfmws_nr = cxl_get_num_cfmws();
> + if (!cfmws_nr)
> + goto try_prmt;
> +
> + /* HACK: Calculate the interleaved offset and find the matching base */
> + if (pci_dev->bus->number != 0xe1 && pci_dev->bus->number != 0xc1)
> + goto try_prmt;
> +
> + dev = pci_dev->bus->number == 0xe1 ? 0 : 1;
> + offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff);
That looks correct to me, a test too:
>>> dpa = 0x1fffffffff
>>> dev = 1
>>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
>>> "0x%x" % offset
'0x3fffffffff'
>>> dev = 0
>>> offset = (0x100 * (((dpa >> 8) * 2) + dev)) + (dpa & 0xff)
>>> "0x%x" % offset
'0x3ffffffeff'
Thanks for review and testing.
Will fix in v2.
-Robert
> +
> + for (idx = 0; idx < cfmws_nr; idx++) {
> + size = cxl_get_cfmws_size(idx);
> + if (offset < size) {
> + spa = cxl_get_cfmws_base(idx) + offset;
> + goto out;
> + }
> + offset -= size;
> + }
> + /* We failed, fall back to calling the PRMT */
> +try_prmt:
> +
> rc = acpi_call_prm_handler(prm_cxl_dpa_spa_guid, &data);
> if (rc) {
> pci_dbg(pci_dev, "failed to get SPA for %#llx: %d\n", dpa, rc);
> return ULLONG_MAX;
> }
>
> +out:
> pci_dbg(pci_dev, "PRM address translation: DPA -> SPA: %#llx -> %#llx\n", dpa, spa);
>
> return spa;
Powered by blists - more mailing lists