[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250313161226.00000038@huawei.com>
Date: Thu, 13 Mar 2025 16:12:26 +0000
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Gregory Price <gourry@...rry.net>
CC: <lsf-pc@...ts.linux-foundation.org>, <linux-mm@...ck.org>,
<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early
Boot
On Mon, 3 Mar 2025 19:32:43 -0500
Gregory Price <gourry@...rry.net> wrote:
> On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> > ------------------------------------------------------------------
> > Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> > ------------------------------------------------------------------
> >
> > This table is responsible for reporting each "CXL Host Bridge" and
> > "CXL Fixed Memory Window" present at boot - which enables early boot
> > software to manage those devices and the memory capacity presented
> > by those devices.
> >
> > Example CEDT Entries (truncated)
> > Subtable Type : 00 [CXL Host Bridge Structure]
> > Reserved : 00
> > Length : 0020
> > Associated host bridge : 00000005
> >
> > Subtable Type : 01 [CXL Fixed Memory Window Structure]
> > Reserved : 00
> > Length : 002C
> > Reserved : 00000000
> > Window base address : 000000C050000000
> > Window size : 0000003CA0000000
> >
> > If this memory is NOT marked "Special Purpose" by BIOS (next section),
> > you should find a matching entry EFI Memory Map and /proc/iomem
> >
> > BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] usable
> > /proc/iomem: c050000000-fcefffffff : System RAM
> >
> >
> > Observation: This memory is treated as 100% normal System RAM
> >
> > 1) This memory may be placed in any zone (ZONE_NORMAL, typically)
> > 2) The kernel may use this memory for arbitrary allocations
> > 4) The driver still enumerates CXL devices and memory regions, but
> > 3) The CXL driver CANNOT manage this memory (as of today)
> > (Caveat: *some* RAS features may still work, possibly)
> >
> > This creates an nuanced management state.
> >
> > The memory is online by default and completely usable, AND the driver
> > appears to be managing the devices - BUT the memory resources and the
> > management structure are fundamentally separate.
> > 1) CXL Driver manages CXL features
> > 2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
> >
>
> Adding some additional context here
>
> -------------------------------------
> Nuance X: NUMA Nodes and ACPI Tables.
> -------------------------------------
>
> ACPI Table parsing is partially architecture/platform dependent, but
> there is common code that affects boot-time creation of NUMA nodes.
>
> NUMA-nodes are not a dynamic resource. They are (presently, Feb 2025)
> statically configured during kernel init, and the number of possible
> NUMA nodes (N_POSSIBLE) may not change during runtime.
>
> CEDT/CFMW and SRAT/Memory Affinity entries describe memory regions
> associated with CXL devices. These tables are used to allocate NUMA
> node IDs during _init.
>
> The "System Resource Affinity Table" has "Memory Affinity" entries
> which associate memory regions with a "Proximity Domain"
>
> Subtable Type : 01 [Memory Affinity]
> Length : 28
> Proximity Domain : 00000001
> Reserved1 : 0000
> Base Address : 000000C050000000
> Address Length : 0000003CA0000000
>
> The "Proximity Domain" utilized by the kernel ACPI driver to match this
> region with a NUMA node (in most cases, the proximity domains here will
> directly translate to a NUMA node ID - but not always).
>
> CEDT/CFMWS do not have a proximity domain - so the kernel will assign it
> a NUMA node association IFF no SRAT Memory Affinity entry is present.
>
> SRAT entries are optional, CFMWS are required for each host bridge.
They aren't required for each HB. You could have multiple host bridge and one CFMWS
as long as you have decided to only support interleave.
I would only expect to see this where the bios is instantiating CFMWS
entries to match a specific locked down config though.
>
> If SRAT entries are present, one NUMA node is created for each detected
> proximity domain in the SRAT. Additional NUMA nodes are created for each
> CFMWS without a matching SRAT entry.
Don't forget the fun of CFMWS covering multiple SRAT entries (I think
we just go with the first one?)
>
> CFMWS describes host-bridge information, and so if SRAT is missing - all
> devices behind the host bridge will become naturally associated with the
> same NUMA node.
I wouldn't go with naturally for the reason below. It happens, but maybe
not natural :)
>
>
> big long TL;DR:
>
> This creates the subtle assumption that each host-bridge will have
> devices with similar performance characteristics if they're intended
> for use as general purpose memory and/or interleave.
Not just devices, also topologies. Could well have switches below some
ports and direct connected devices on others.
>
> This means you should expect to have to reboot your machine if a
> different NUMA topology is needed (for example, if you are physically
> hotunplugging a volatile device to plug in a non-volatile device).
If the bios is friendly you should be able to map that to a different
CFMWS, but sure what bios is that nice?
>
>
>
> Stay tuned for more Fun and Profit with ACPI tables :]
:)
> ~Gregory
>
Powered by blists - more mailing lists