lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Z6OMcLt3SrsZjgvw@gourry-fedora-PF4VCD3F>
Date: Wed, 5 Feb 2025 11:06:08 -0500
From: Gregory Price <gourry@...rry.net>
To: lsf-pc@...ts.linux-foundation.org
Cc: linux-mm@...ck.org, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: CXL Boot to Bash - Section 2: The Drivers

(background reading as we build up complexity)

Driver Management - Decoders, HPA/SPA, DAX, and RAS.

The Drivers
===========
----------------------
The Story Up 'til Now.
----------------------

When we left the Platform arena, assuming we've configured with special
purpose memory, we are left with an entry in the memory map like so:

BIOS-e820:   [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
/proc/iomem: c050000000-fcefffffff : Soft Reserved

This resource (see mm/resource.c) is left unused until a driver comes
along to actually surface it to allocators (or some other interface).

In our case, the drivers involved (or at least the ones we'll reference)

drivers/base/     : device probing, memory (block) hotplug
drivers/acpi/     : device hotplug
drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
drivers/pci/      : PCI device probing
drivers/cxl/      : CXL device probing
drivers/dax/      : cxl device to memory resource association

We don't necessarily care about the specifics of each driver, we'll
focus on just the aspects that ultimately affect memory management.

-------------------------------
Step 4: Basic build complexity.
-------------------------------
To make a long story short:

CXL Build Configurations:
  CONFIG_CXL_ACPI
  CONFIG_CXL_BUS
  CONFIG_CXL_MEM
  CONFIG_CXL_PCI
  CONFIG_CXL_PORT
  CONFIG_CXL_REGION

DAX Build Configurations:
  CONFIG_DEV_DAX
  CONFIG_DEV_DAX_CXL
  CONFIG_DEV_DAX_KMEM

Without all of these enabled, your journey will end up cut short because
some piece of the probe process will stop progressing.

The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
being enabled. You end up with memory regions without dax devices.

[/sys/bus/cxl/devices]# ls
dax_region0  decoder0.0  decoder1.0  decoder2.0 .....
dax_region1  decoder0.1  decoder1.1  decoder3.0 .....

^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
surface as dax devices, which can then be converted to system ram.


---------------------------------------------------------------
Step 5: The CXL driver associating devices and iomem resources.
---------------------------------------------------------------

The CXL driver wires up the following devices:
   root        :  CXL root
   portN       :  An intermediate or endpoint destination for accesses
   memN        :  memory devices


Each device in the heirarchy may have one or more decoders
   decoderN.M  :  Address routing and translation devices


The driver will also create additional objects and associations
   regionN     :  device-to-iomem resource mapping
   dax_regionN :  region-to-dax device mapping


Most associations built by the driver are done by validating decoders
against each other at each point in the heirarchy.

  Root decoders describe memory regions and route DMA to ports.
  Intermediate decoders route DMA through CXL fabric.
  Endpoint decoders translate addresses (Host to device).


A Root port has 1 decoder per associated CFMW in the CEDT
   decoder0.0  ->  `c050000000-fcefffffff   : Soft Reserved`


A region (iomem resource mapping) can be created for these decoders
   [/sys/bus/cxl/devices/region0]# cat resource size target0
      0xc050000000   0x3ca0000000   decoder5.0


A dax_region surfaces these regions as a dax device
   [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
      0xc050000000


So in a simple environment with 1 device, we end up with a mapping
that looks something like this.

     root      ---   decoder0.0  --- region0 -- dax_region0 -- dax0
       |                |              |
     port1     ---   decoder1.0        |
       |                |              |
     endpoint0 ---   decoder3.0--------/


Much of the complexity in region creation stems from validating decoder
programming and associating regions with targets (endpoint decoders).

The take-away from this section is the existence of "decoders", of which
there may be an arbitrary number between the root and endpoint.

This will be relevant when we talk about RAS (Poison) and Interleave.


---------------------------------------------------------------
Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
---------------------------------------------------------------

The last step in surfacing memory to allocators is to convert a dax
device into memory blocks. On most default kernel builds, dax devices
are not automatically converted to SystemRAM.

Policy Choices
   userland policy:  daxctl
   default-online :  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
                     or
		     CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
		     or
		     memhp_default_state=*

To convert a dax device to SystemRAM utilizing daxctl:

  daxctl online-memory dax0.0 [--no-movable]

  By default the memory will online into ZONE_MOVABLE
  The --no-movable option will online the memory in ZONE_NORMAL


Alternatively, this can be done at Build or Boot time using
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE   (v6.13 or below)
  CONFIG_MHP_DEFAULT_ONLINE_TYPE_*       (v6.14 or above)
  memhp_default_state=*                  (boot param predating cxl)

I will save the discussion of ZONE selection to the next section,
which will cover more memory-hotplug specifics.

At this point, the memory blocks are exposed to the kernel mm allocators
and may be used as normal System RAM.


---------------------------------------------------------
Second bit of nuanced complexity: Memory Block Alignment.
---------------------------------------------------------
In section 1, we introduced CEDT / CFMW and how they map to iomem
resources.  In this section we discussed out we surface memory blocks
to the kernel allocators.

However, at no time did platform, arch code, and driver communicate
about the expected size of a memory block. In most cases, the size
of a memory block is defined by the architecture - unaware of CXL.

On x86, for example, the heuristic for memory block size is:
   1) user boot-arg value
   2) Maximize size (up to 2GB) if operating on bare metal
   3) Use smallest value that aligns with the end of memory

The problem is that [SOFT RESERVED] memory is not considered in the
alignment calculation - and not all [SOFT RESERVED] memory *should*
be considered for alignment.

In the case of our working example (real system, btw):

         Subtable Type : 01 [CXL Fixed Memory Window Structure]
   Window base address : 000000C050000000
           Window size : 0000003CA0000000

The base is 256MB aligned (the minimum for the CXL Spec), and the
window size is 512MB.  This results in a loss of almost a full memory
block worth of memory (~1280MB on the front, and ~512MB on the back).

This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).

[1] has been proposed to allow for drivers (specifically ACPI) to advise
the memory hotplug system on the suggested alignment, and for arch code
to choose how to utilize this advisement.

[1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@gourry.net/


--------------------------------------------------------------------
The Complexity story up til now (what's likely to show up in slides)
--------------------------------------------------------------------
Platform and BIOS:
  May configure all the devices prior to kernel hand-off.
  May or may not support reconfiguring / hotplug.
BIOS and EFI:
  EFI_MEMORY_SP              - used to defer management to drivers
Kernel Build and Boot:
  CONFIG_EFI_SOFT_RESERVE=n  - Will always result in CXL as SystemRAM
  nosoftreserve              - Will always result in CXL as SystemRAM
  kexec                      - SystemRAM configs carry over to target
Driver Build Options Required
  CONFIG_CXL_ACPI
  CONFIG_CXL_BUS
  CONFIG_CXL_MEM
  CONFIG_CXL_PCI
  CONFIG_CXL_PORT
  CONFIG_CXL_REGION
  CONFIG_DEV_DAX
  CONFIG_DEV_DAX_CXL
  CONFIG_DEV_DAX_KMEM
User Policy
  CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
  CONFIG_MHP_DEFAULT_ONLINE_TYPE       (>=v6.14)
  memhp_default_state                  (boot param)
  daxctl online-memory daxN.Y          (userland)
Nuances
  Early-boot resource re-use
  Memory Block Alignment

--------------------------------------------------------------------
Next Up:
   Memory (Block) Hotplug - Zones and Kernel Use of CXL
   RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
   Interleave - RAS and Region Management (Hotplug-ability)

~Gregory

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ