[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <570b18f4-3790-4e57-8d80-a5301e5d8af2@fujitsu.com>
Date: Fri, 7 Mar 2025 00:57:18 +0000
From: "Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
To: Gregory Price <gourry@...rry.net>, "lsf-pc@...ts.linux-foundation.org"
<lsf-pc@...ts.linux-foundation.org>
CC: "linux-mm@...ck.org" <linux-mm@...ck.org>, "linux-cxl@...r.kernel.org"
<linux-cxl@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: CXL Boot to Bash - Section 2a (Drivers): CXL Decoder Programming
Hey Gregory,
Thank you so much for your detailed introduction to the entire CXL
software ecosystem, which I have thoroughly read. You are truly excellent.
On 07/03/2025 07:56, Gregory Price wrote:
> I decided to dig into decoder programming as as an addendum to the
> Driver section - where I said I *wouldn't* do this. It's important
> though, when discussing interleave. So alas, we should at least have
> some base understanding of what the heck decoders are actually doing.
>
> This is not a regutitation of the spec, you can think of it closer to
> a "Theory of Operation" or whatever. I will show discrete examples of
> how ACPI tables, system memory map, and decoders relate.
>
> ----------------------------------------
> Definitions: Addresses and HDM Decoders.
> ----------------------------------------
>
> An HDM Decoder can be thought shorthand as a "routing" mechanism,
> where the a Physical Address is used to determine one of:
>
> 1) Fabric routing (i.e. which pipe to send a request down)
> 2) Address translation (Host to Device Physical Address)
>
> In section 2, I referenced a simple device-to-decoder mapping:
>
> root --- decoder0.0 -- Root Port Decoder
> | |
> port1 --- decoder1.0 -- Host Bridge Decoder
> | |
> endpoint0 --- decoder2.0 -- Endpoint Decoder
Here, I noticed something that differs slightly from my understanding:
"root --- decoder0.0 -- Root Port Decoder."
From the perspective of the Linux Driver, decoder0.0 usually refers to
associated a CFMWs. Moreover, according to Spec r3.1 Table 8-22 CXL HDM Decoder Capability,
the CXL Root Port (also known as R in the table) is not permitted to implement
the HDM decoder.
If I have misunderstood something, please let me know.
Thanks
Zhijian
>
> Barring any special innovations (cough) - endpoint decoders should
> be the only decoders that actually "Translation" addresses - at least
> for basic volatile memory devices.
>
> All other decoders (Root, Host Bridge, Switch, etc) should simply
> forward DMA requests with the original Physical Address intact to
> the correct downstream device.
>
> For extra confusion, there are now 3 "Physical Address" domains
>
> System Physical Address (SPA)
> The physical address of some location according to linux.
> This is the address you see in the system memory map.
>
> Host Physical Address (HPA)
> An abstract address used by decoders (I'll explain later)
>
> Device Physical Address (DPA)
> A device-local physical address (e.g. if a device has 1TB of
> memory, it's DPA range might be 0-0x10000000000)
>
>
> ----------------------------
> DMA Routing (No Interleave).
> ----------------------------
> Ok, we have some decoders and confusing physical address definitions,
> how does a DMA actually go from processor to DRAM via these decoders?
>
> Lets consider our simple fabric with 256MB of memory at SPA base 4GB.
>
> Lets assume this was all set up statically by BIOS. We'd have the
> following CEDT CFMWS (See Section 0 - ACPI) and decoder programming.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000110000000] usable <- SPA
>
> Decoders
> root --- decoder0.0 -- range=[0x100000000, 0x110000000]
> | |
> port1 --- decoder1.0 -- range=[0x100000000, 0x110000000]
> | |
> endpoint0 --- decoder2.0 -- range=[0x100000000, 0x110000000]
> ```
>
> When the CPU accessed an address in this range, the memory controller
> will send the request down the CXL fabric. The following steps occur:
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> 2) host bridge decoder identifies HPA(0x101234567) is valid and
> forwards to endpoint associated with that address (endpoint0)
>
> 3) endpoint decoder identifies HPA(0x101234567) is valid and
> translates that address to DPA(0x01234567).
>
> 4) The endpoint device uses DPA(0x01234567) to fulfill the request.
>
> In this scenario, our endpoint has a DPA range of (0, 0x10000000),
> but technically DPA address space is device-defined and may be sparse.
>
> As you can see, the root and host bridge decoders simply "route" the
> access to the next appropriate hop, while the endpoint decoder actually
> does the translation work.
>
>
> What if instead, we had two 256MB endpoints on the same host bridge?
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region
> Window size : 0000000020000000 <- 512MB
> Interleave Members (2^n) : 00 <- Not interleaved
>
> Memory Map:
> [mem 0x0000000100000000-0x0000000120000000] usable <- SPA
>
> Decoders
> decoder0.0
> range=[0x100000000, 0x120000000]
> |
> decoder1.0
> range=[0x100000000, 0x120000000]
> / \
> decoded2.0 decoder3.0
> range=[0x100000000, 0x110000000] range=[0x110000000, 0x120000000]
> ```
>
> We still only have a single root port and host bridge decoder that
> covers the entire 512MB range, but there are now 2 differently
> programmed endpoint decoders.
>
> This makes the routing a little more obvious. The root and host bridge
> decoders cover the entire SPA space (512MB), while the endpoint decoders
> only cover their own address space (256MB).
>
> The host bridge in this case is responsible for routing the request to
> the correct endpoint.
>
>
> What if we had 2 endpoints, each attached to their own host bridges?
> In this case We'd have 2 root ports and host bridge decoders.
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000010000000 <- 256MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000006 <- Host Bridge _UID
>
> Memory Map - this may or may not be collapsed depending on Linux arch
> [mem 0x0000000100000000-0x0000000110000000] usable <- System Phys Address
> [mem 0x0000000110000000-0x0000000120000000] usable <- System Phys Address
>
> Decoders
> decoder0.0 decoder1.0 - roots
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder2.0 decoder3.0 - host bridges
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> | |
> decoder4.0 decoder5.0 - endpoints
> [0x100000000, 0x110000000] [0x110000000, 0x120000000]
> ```
>
> This scenario looks functionally same as the first - with two distinct,
> non-overlapping sets of decoders (any given SPA may only be services by
> one device). The platform memory controller is responsible for routing
> the address to the correct root decoder.
>
> In Section 4 (Interleave) we'll discuss a bit how the interleave is
> accomplished - as this depends whether you are interleaving across
> host bridges (aggregation) or within a host bridge (bifurcation).
>
>
>
> ---------------------------------------------
> Nuance: Host Physical Address... translation?
> ---------------------------------------------
>
> You might have noticed that all the addresses in the examples I showed
> are direct subsets of their parent decoder address ranges. The root is
> assigned a System Physical Address according to the system memory map,
> and all decoders under it are a subset of that range.
>
> You may have even noticed routing steps suddenly change from SPA to HPA
>
> 0) CPU accesses SPA(0x101234567)
>
> 1) root decoder identifies HPA(0x101234567) is valid and forwards
> to host bridge associated with that address (port 1)
>
> So what the heck is a "Host Physical Address"?
> Why isn't everything just described as a "System Physical Address"?
>
> CXL HDM decoders *definitionally* handle HPA to DPA translations.
>
> That's it, that's the definition of an HPA.
>
> On MOST systems, what you see in the memory map is an SPA, and SPA=HPA,
> so all the decoders will appear to be programmed with SPA. The platform
> MAY perform translation before a request is routed to decoder complex.
>
> I will cover an example of this in-depth in an interleave addendum.
>
> So the answer is that some ambiguity exists regarding whether platforms
> can/should do translation prior to HDM decoders even being utilized. So
> for the sake of making everything more complicated and confusing for very
> little value:
>
> 1) decoders definitionally do "HPA to DPA" translation
> 2) most of the time "SPA=HPA"
> 3) so decoders mostly do "SPA to DPA" translation
>
> If you're confused, that's ok, I was too - and still am. But Hopefully
> between this section and Section 4 (Interleave) we can be marginally
> less confused together.
>
>
> -----------------------------------------------
> Nuance: Memory Holes and Hotplug Memory Blocks!
> -----------------------------------------------
> Help, BIOS split my memory device across non-contiguous memory regions!
>
> ```
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000100000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> CEDT
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Reserved : 00
> Length : 002C
> Reserved : 00000000
> Window base address : 0000000110000000 <- Memory Region 1
> Window size : 0000000080000000 <- 128MB
> Interleave Members (2^n) : 00 <- Not interleaved
> First Target : 00000007 <- Host Bridge _UID
>
> Memory Map
> [mem 0x0000000100000000-0x0000000107FFFFFF] usable <- SPA
> [mem 0x0000000108000000-0x000000010FFFFFFF] reserved
> [mem 0x0000000110000000-0x0000000118000000] usable <- SPA
> ```
>
> Take a breath. Everything will be ok.
>
> You can have multiple decoders at each point in the decoder complex!
> (Most devices should implement for multiple decoders).
>
> ```
> Decoders
> Root Port 0
> / \
> decoder0.0 decoder0.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Host Bridge 7
> / \
> decoder1.0 decoder1.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> \ /
> Endpoint 0
> / \
> decoder2.0 decoder2.1
> [0x100000000, 0x108000000] [0x110000000, 0x118000000]
> ```
>
> If your BIOS adds a memory hole, it better also use multiple decoders.
>
> Oh, wait, Section 2 and Section 3 allude to hotplug memory blocks
> having size and alignment issues!
>
> If your BIOS adds a memory hole, it better also do it on Linux hotplug
> memory block alignment (2GB on x86) or you'll lose 1 hotplug memory
> block of capacity per CFMWS.
>
> Oi, talk about some rough edges, right? :[
>
> ---------------------------------------
> Nuance: BIOS vs OS Programmed Decoders.
> ---------------------------------------
> The driver can (and does) program these decoders. However, it's
> entirely normal for BIOS/EFI to program decoders prior to OS init.
>
> Earlier in section 2 I said:
> Most associations built by the driver are done by validating decoders
>
> What I meant by this is the driver does one of two things with decoders:
>
> 1) Detects BIOS programmed decoders and sanity checks them.
> If an unexpected configuration is found, it bails out.
> This memory is not accessible if EFI_MEMORY_SP is set.
>
> 2) Provide an interface for user policy configuration of the decoders
>
> For the most part, the mechanism is the same. This carve-out is to tell
> you if something isn't working, you should check whether the BIOS/EFI or
> driver programmed the decoders. It will help debug the issue quicker.
>
> In my experience, it's USUALLY a bad ACPI table.
>
> This distinction will be more important in Section 4 (Interleave) when
> we discuss Inter-Host-Bridge and Intra-Host-Bridge interleave.
>
> ~Gregory
>
Powered by blists - more mailing lists