lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8ZKKwDnuAjtyohz@gourry-fedora-PF4VCD3F>
Date: Mon, 3 Mar 2025 19:32:43 -0500
From: Gregory Price <gourry@...rry.net>
To: lsf-pc@...ts.linux-foundation.org
Cc: linux-mm@...ck.org, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [LSF/MM] CXL Boot to Bash - Section 1: BIOS, EFI, and Early Boot

On Tue, Feb 04, 2025 at 09:17:09PM -0500, Gregory Price wrote:
> ------------------------------------------------------------------
> Step 2: BIOS / EFI generates the CEDT (CXL Early Detection Table).
> ------------------------------------------------------------------
> 
> This table is responsible for reporting each "CXL Host Bridge" and
> "CXL Fixed Memory Window" present at boot - which enables early boot
> software to manage those devices and the memory capacity presented
> by those devices.
> 
> Example CEDT Entries (truncated) 
>          Subtable Type : 00 [CXL Host Bridge Structure]
>               Reserved : 00
>                 Length : 0020
> Associated host bridge : 00000005
> 
>          Subtable Type : 01 [CXL Fixed Memory Window Structure]
>               Reserved : 00
>                 Length : 002C
>               Reserved : 00000000
>    Window base address : 000000C050000000
>            Window size : 0000003CA0000000
> 
> If this memory is NOT marked "Special Purpose" by BIOS (next section),
> you should find a matching entry EFI Memory Map and /proc/iomem
> 
> BIOS-e820:   [mem 0x000000c050000000-0x000000fcefffffff] usable
> /proc/iomem: c050000000-fcefffffff : System RAM
> 
> 
> Observation: This memory is treated as 100% normal System RAM
> 
>    1) This memory may be placed in any zone (ZONE_NORMAL, typically)
>    2) The kernel may use this memory for arbitrary allocations
>    4) The driver still enumerates CXL devices and memory regions, but
>    3) The CXL driver CANNOT manage this memory (as of today)
>       (Caveat: *some* RAS features may still work, possibly)
> 
> This creates an nuanced management state.
> 
> The memory is online by default and completely usable, AND the driver
> appears to be managing the devices - BUT the memory resources and the
> management structure are fundamentally separate.
>    1) CXL Driver manages CXL features
>    2) Non-CXL SystemRAM mechanisms surface the memory to allocators.
> 

Adding some additional context here

-------------------------------------
Nuance X: NUMA Nodes and ACPI Tables.
-------------------------------------

ACPI Table parsing is partially architecture/platform dependent, but
there is common code that affects boot-time creation of NUMA nodes.

NUMA-nodes are not a dynamic resource.  They are (presently, Feb 2025)
statically configured during kernel init, and the number of possible
NUMA nodes (N_POSSIBLE) may not change during runtime.

CEDT/CFMW and SRAT/Memory Affinity entries describe memory regions
associated with CXL devices.  These tables are used to allocate NUMA
node IDs during _init.

The "System Resource Affinity Table" has "Memory Affinity" entries
which associate memory regions with a "Proximity Domain"

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001
            Reserved1 : 0000
         Base Address : 000000C050000000
       Address Length : 0000003CA0000000

The "Proximity Domain" utilized by the kernel ACPI driver to match this
region with a NUMA node (in most cases, the proximity domains here will
directly translate to a NUMA node ID - but not always).

CEDT/CFMWS do not have a proximity domain - so the kernel will assign it
a NUMA node association IFF no SRAT Memory Affinity entry is present.

SRAT entries are optional, CFMWS are required for each host bridge.

If SRAT entries are present, one NUMA node is created for each detected
proximity domain in the SRAT. Additional NUMA nodes are created for each
CFMWS without a matching SRAT entry.

CFMWS describes host-bridge information, and so if SRAT is missing - all
devices behind the host bridge will become naturally associated with the
same NUMA node.


big long TL;DR:

This creates the subtle assumption that each host-bridge will have
devices with similar performance characteristics if they're intended
for use as general purpose memory and/or interleave.

This means you should expect to have to reboot your machine if a
different NUMA topology is needed (for example, if you are physically
hotunplugging a volatile device to plug in a non-volatile device).



Stay tuned for more Fun and Profit with ACPI tables :]
~Gregory

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ