[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250313172004.00002236@huawei.com>
Date: Thu, 13 Mar 2025 17:20:04 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Gregory Price <gourry@...rry.net>
CC: <lsf-pc@...ts.linux-foundation.org>, <linux-mm@...ck.org>,
<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA
Flexiblity
On Fri, 7 Mar 2025 22:23:05 -0500
Gregory Price <gourry@...rry.net> wrote:
> In the last section we discussed how the CEDT CFMWS and SRAT Memory
> Affinity structures are used by linux to "create" NUMA nodes (or at
> least mark them as possible). However, the examples I used suggested
> that there was a 1-to-1 relationship between CFMWS and devices or
> host bridges.
>
> This is not true - in fact, CFMWS are a simply a carve out of System
> Physical Address space which may be used to map any number of endpoint
> devices behind the associated Host Bridge(s).
>
> The limiting factor is what your platform vendor BIOS supports.
>
> This section describes a handful of *possible* configurations, what NUMA
> structure they will create, and what flexibility this provides.
>
> All of these CFMWS configurations are made up, and may or may not exist
> in real machines. They are a conceptual teching tool, not a roadmap.
>
> (When discussing interleave in this section, please note that I am
> intentionally omitting details about decoder programming, as this
> will be covered later.)
>
>
> -------------------------------
> One 2GB Device, Multiple CFMWS.
> -------------------------------
> Lets imagine we have one 2GB device attached to a host bridge.
>
> In this example, the device hosts 2GB of persistent memory - but we
> might want the flexibility to map capacity as volatile or persistent.
Fairly sure we block persistent in a volatile CFMWS in the kernel.
Any bios actually does this?
You might have a variable partition device but I thought in kernel at
least we decided that no one was building that crazy?
Maybe a QoS split is a better example to motivate one range, two places?
>
> The platform vendor may decide that they want to reserve two entirely
> separate system physical address ranges to represent the capacity.
>
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 000A <- Bit(3) - Persistant
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> You might have a CEDT with two CFMWS as above, where the base addresses
> are `0x100000000` and `0x200000000` respectively, but whose window sizes
> cover the entire 2GB capacity of the device. This affords the user
> flexibility in where the memory is mapped depending on if it is mapped
> as volatile or persistent while keeping the two SPA ranges separate.
>
> This is allowed because the endpoint decoders commit device physical
> address space *in order*, meaning no two regions of device physical
> address space can be mapped to more than one system physical address.
>
> i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)
>
> (See Section 2a - decoder programming).
>
> -------------------------------------------------------------
> Two Devices On One Host Bridge - With and Without Interleave.
> -------------------------------------------------------------
> What if we wanted some capacity on each endpoint hosted on its own NUMA
> node, and wanted to interleave a portion of each device capacity?
If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
checks on HPA kick in here and restrict flexibility a lot
(assuming I understand them correctly that is)
This is a good illustration of why we should at some point revisit
multiple NUMA nodes per CFMWS. We have to burn SPA space just
to get nodes. From a spec point of view all that is needed here
is a single CFMWS.
>
> We could produce the following CFMWS configuration.
> ```
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000200000000 <- Memory Region 2
> Window size : 0000000080000000 <- 2GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000300000000 <- Memory Region 3
> Window size : 0000000100000000 <- 4GB
> Interleave Members (2^n) : 00
> Interleave Arithmetic : 00
> Reserved : 0000
> Granularity : 00000000
> Restrictions : 0006 <- Bit(2) - Volatile
> QtgId : 0001
> First Target : 00000007 <- Host Bridge _UID
>
> NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
> ```
>
> In this configuration, we could still do what we did with the prior
> configuration (2 CFMWS), but we could also use the third root decoder
> to simplify decoder programming of interleave.
>
> Since the third region has sufficient capacity (4GB) to cover both
> devices (2GB/each), we can actually associate the entire capacity of
> both devices in that region.
>
> We'll discuss this decoder structure in-depth in Section 4.
>
Powered by blists - more mailing lists