linux-kernel - Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9ff44eee-fc59-4aa1-a3f5-5ece8d1af5ed@amd.com>
Date: Tue, 27 Jan 2026 16:52:56 +0000
From: Alejandro Lucero Palau <alucerop@....com>
To: Tomasz Wolski <Tomasz.Wolski@...itsu.com>
Cc: Smita.KoralahalliChannabasappa@....com, alison.schofield@...el.com,
 ardb@...nel.org, benjamin.cheatham@....com, bp@...en8.de,
 dan.j.williams@...el.com, dave.jiang@...el.com, dave@...olabs.net,
 gregkh@...uxfoundation.org, huang.ying.caritas@...il.com,
 ira.weiny@...el.com, jack@...e.cz, jeff.johnson@....qualcomm.com,
 jonathan.cameron@...wei.com, len.brown@...el.com, linux-cxl@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-pm@...r.kernel.org, lizhijian@...itsu.com, ming.li@...omail.com,
 nathan.fontenot@....com, nvdimm@...ts.linux.dev, pavel@...nel.org,
 peterz@...radead.org, rafael@...nel.org, rrichter@....com, skoralah@....com,
 terry.bowman@....com, vishal.l.verma@...el.com, willy@...radead.org,
 yaoxt.fnst@...itsu.com, yazen.ghannam@....com
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft
 Reserved memory ranges


On 10/1/25 18:15, Tomasz Wolski wrote:
>>> [responding to the questions raised here before reviewing the patch...]
>>>
>>> Koralahalli Channabasappa, Smita wrote:
>>>> Hi Alejandro,
>>>>
>>>> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
>>>>> On 1/22/26 04:55, Smita Koralahalli wrote:
>>>>>> The current probe time ownership check for Soft Reserved memory based
>>>>>> solely on CXL window intersection is insufficient. dax_hmem probing is
>>>>>> not
>>>>>> always guaranteed to run after CXL enumeration and region assembly, which
>>>>>> can lead to incorrect ownership decisions before the CXL stack has
>>>>>> finished publishing windows and assembling committed regions.
>>>>>>
>>>>>> Introduce deferred ownership handling for Soft Reserved ranges that
>>>>>> intersect CXL windows at probe time by scheduling deferred work from
>>>>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
>>>>>> assembly before deciding ownership.
>>>>>>
>>>>>> Evaluate ownership of Soft Reserved ranges based on CXL region
>>>>>> containment.
>>>>>>
>>>>>>       - If all Soft Reserved ranges are fully contained within committed
>>>>>> CXL
>>>>>>         regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>>>>>>         dax_cxl to bind.
>>>>>>
>>>>>>       - If any Soft Reserved range is not fully claimed by committed CXL
>>>>>>         region, tear down all CXL regions and REGISTER the Soft Reserved
>>>>>>         ranges with dax_hmem instead.
>>>>> I was not sure if I was understanding this properly, but after looking
>>>>> at the code I think I do ... but then I do not understand the reason
>>>>> behind. If I'm right, there could be two devices and therefore different
>>>>> soft reserved ranges, with one getting an automatic cxl region for all
>>>>> the range and the other without that, and the outcome would be the first
>>>>> one getting its region removed and added to hmem. Maybe I'm missing
>>>>> something obvious but, why? If there is a good reason, I think it should
>>>>> be documented in the commit and somewhere else.
>>>> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
>>>>
>>>> I'm trying to restate the "why" behind this based on Dan's earlier
>>>> guidance. Please correct me if I'm misrepresenting it Dan.
>>>>
>>>> The policy is meant to be coarse: If all SR ranges that intersect CXL
>>>> windows are fully contained by committed CXL regions, then we have high
>>>> confidence that the platform descriptions line up and CXL owns the memory.
>>>>
>>>> If any SR range that intersects a CXL window is not fully covered by
>>>> committed regions then we treat that as unexpected platform shenanigans.
>>>> In that situation the intent is to give up on CXL entirely for those SR
>>>> ranges because partial ownership becomes ambiguous.
>>>>
>>>> This is why the fallback is global and not per range. The goal is to
>>>> leave no room for mixed some SR to CXL, some SR to HMEM configurations.
>>>> Any mismatch should push the platform issue back to the vendor to fix
>>>> the description (ideally preserving the simplifying assumption of a 1:1
>>>> correlation between CXL Regions and SR).
>>>>
>>>> Thanks for pointing this out. I will update the why in the next revision.
>>> You have it right. This is mostly a policy to save debug sanity and
>>> share the compatibility pain. You either always get everything the BIOS
>>> put into the memory map, or you get the fully enlightened CXL world.
>>>
>>> When accelerator memory enters the mix it does require an opt-in/out of
>>> this scheme. Either the device completely opts out of this HMEM fallback
>>> mechanism by marking the memory as Reserved (the dominant preference),
>>> or it arranges for CXL accelerator drivers to be present at boot if they
>>> want to interoperate with this fallback. Some folks want the fallback:
>>> https://lpc.events/event/19/contributions/2064/
>>
>> I will take a look at this presentation, but I think there could be
>> another option where accelerators information is obtained during pci
>> enumeration by the kernel and using this information by this
>> functionality skipping those ranges allocated to them. Forcing them to
>> be compiled with the kernel would go against what distributions
>> currently and widely do with initramfs. Not sure if some current "early"
>> stubs could be used for this though but the information needs to be
>> recollected before this code does the checks.
>>
>>
>>>>> I have also problems understanding the concurrency when handling the
>>>>> global dax_cxl_mode variable. It is modified inside process_defer_work()
>>>>> which I think can have different instances for different devices
>>>>> executed concurrently in different cores/workers (the system_wq used is
>>>>> not ordered). If I'm right race conditions are likely.
>>> It only works as a single queue of regions. One sync point to say "all
>>> collected regions are routed into the dax_hmem or dax_cxl bucket".
>>
>> That is how I think it should work, handling all the soft reserved
>> ranges vs regions by one code execution. But that is not the case. More
>> later.
>>
>>
>>>> Yeah, this is something I spent sometime thinking on. My rationale
>>>> behind not having it and where I'm still unsure:
>>>>
>>>> My assumption was that after wait_for_device_probe(), CXL topology
>>>> discovery and region commit are complete and stable.
>>> ...or more specifically, any CXL region discovery after that point is a
>>> typical runtime dynamic discovery event that is not subject to any
>>> deferral.
>>>
>>>> And each deferred worker should observe the same CXL state and
>>>> therefore compute the same final policy (either DROP or REGISTER).
>>> The expectation is one queue, one event that takes the rwsem and
>>> dispositions all present regions relative to initial soft-reserve memory
>>> map.
>>>
>>>> Also, I was assuming that even if multiple process_defer_work()
>>>> instances run, the operations they perform are effectively safe to
>>>> repeat.. though I'm not sure on this.
>>> I think something is wrong if the workqueue runs more than once. It is
>>> just a place to wait for initial device probe to complete and then fixup
>>> all the regions (allow dax_region registration to proceed) that were
>>> waiting for that.
>>>
>>>> cxl_region_teardown_all(): this ultimately triggers the
>>>> devm_release_action(... unregister_region ...) path. My expectation was
>>>> that these devm actions are single shot per device lifecycle, so
>>>> repeated teardown attempts should become noops.
>>> Not noops, right? The definition of a devm_action is that they always
>>> fire at device_del(). There is no facility to device_del() a device
>>> twice.
>>>
>>>> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
>>>> which takes "cxl_rwsem.region". That should serialize decoder detach and
>>>> region teardown.
>>>>
>>>> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
>>>> boot are fine as the rescan path will simply rediscover already present
>>>> devices..
>>> The rescan path likely needs some logic to give up on CXL region
>>> autodiscovery for devices that failed their memmap compatibility check.
>>>
>>>> walk_hmem_resources(.., hmem_register_device): in the DROP case,I
>>>> thought running the walk multiple times is safe because devm managed
>>>> platform devices and memregion allocations should prevent duplicate
>>>> lifetime issues.
>>>>
>>>> So, even if multiple process_defer_work() instances execute
>>>> concurrently, the CXL operations involved in containment evaluation
>>>> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
>>>>
>>>> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
>>>> is not safe when invoked concurrently?
>>> It already races today between natural bus enumeration and the
>>> cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
>>> naturally synchronized by the region's device_lock and regions' rwsem.
>>>
>>>> Or is the primary issue that dax_cxl_mode is a global updated from one
>>>> context and read from others, and should be synchronized even if the
>>>> computed final value will always be the same?
>>> There is only one global hmem_platform device, so only one potential
>>> item in this workqueue.
>>
>> Well, I do not think so.
>>
>>
>> hmem_init() in drivers/dax/device.c walks IORESOURCE_MEM looking for
>> IORES_DESC_SOFT_RESERVED descriptors and calling hmem_register_one for
>> each of them. That leads to create an hmem_platform platform device (no
>> typo, just emphasizing this is a platform device with name
>> hmem_platform) so there will be as many hmem_platform devices as
>> descriptors found.
>>
>>
>> Then each hmem_platform probe() will create an hmem platform device,
>> where a work will be schedule passing this specific hmem platform device
>> as argument. So, each work will check for the specific ranges of its own
>> pdev and not all of them. The check can result in a different value
>> assigned to dax_cxl_mode leading to potential race conditions with
>> concurrent workers and also potentially leaving soft reserved ranges
>> without both, a dax or an hmem device.
> Hi Alejandro,
>
> Isn't below  check in __hmem_register_resource ensuring that only one
> hmem_platform device can be created? So first resource would
> create platform device and set the platform_initialized bool to true
>
> static void __hmem_register_resource(int target_nid, struct resource *res)
> ..
> 	if (platform_initialized)
> 		return;
> ..


Upps. Silly me. So focused on platform_device_alloc() ...


So, the concurrency problem when updating dax_cxl_mode does not exist as 
all potential concurrent workers use the hmem_platform device, but I 
think, or maybe I'm having a really bad day, the hmem devices are 
"indeed" created based on those soft reserved ranges when walking 
hmem_active ... the work is schedule as many times as hmem devices, and 
maybe there could be a problem with concurrent works invoking 
cxl_region_teardown_all().


Anyways, thank you for pointing out what I could not see.


> Thanks,
> Tomasz
>