linux-kernel - Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f4bdf04d-7481-4282-b9da-ce5fcf911af9@amd.com>
Date: Tue, 27 Jan 2026 12:16:41 +0000
From: Alejandro Lucero Palau <alucerop@....com>
To: dan.j.williams@...el.com,
 "Koralahalli Channabasappa, Smita" <skoralah@....com>,
 Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
 linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
 nvdimm@...ts.linux.dev, linux-fsdevel@...r.kernel.org,
 linux-pm@...r.kernel.org
Cc: Ard Biesheuvel <ardb@...nel.org>,
 Alison Schofield <alison.schofield@...el.com>,
 Vishal Verma <vishal.l.verma@...el.com>, Ira Weiny <ira.weiny@...el.com>,
 Jonathan Cameron <jonathan.cameron@...wei.com>,
 Yazen Ghannam <yazen.ghannam@....com>, Dave Jiang <dave.jiang@...el.com>,
 Davidlohr Bueso <dave@...olabs.net>, Matthew Wilcox <willy@...radead.org>,
 Jan Kara <jack@...e.cz>, "Rafael J . Wysocki" <rafael@...nel.org>,
 Len Brown <len.brown@...el.com>, Pavel Machek <pavel@...nel.org>,
 Li Ming <ming.li@...omail.com>, Jeff Johnson
 <jeff.johnson@....qualcomm.com>, Ying Huang <huang.ying.caritas@...il.com>,
 Yao Xingtao <yaoxt.fnst@...itsu.com>, Peter Zijlstra <peterz@...radead.org>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Nathan Fontenot <nathan.fontenot@....com>,
 Terry Bowman <terry.bowman@....com>, Robert Richter <rrichter@....com>,
 Benjamin Cheatham <benjamin.cheatham@....com>,
 Zhijian Li <lizhijian@...itsu.com>, Borislav Petkov <bp@...en8.de>,
 Tomasz Wolski <tomasz.wolski@...itsu.com>
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft
 Reserved memory ranges


On 1/26/26 23:53, dan.j.williams@...el.com wrote:
> [responding to the questions raised here before reviewing the patch...]
>
> Koralahalli Channabasappa, Smita wrote:
>> Hi Alejandro,
>>
>> On 1/23/2026 3:59 AM, Alejandro Lucero Palau wrote:
>>> On 1/22/26 04:55, Smita Koralahalli wrote:
>>>> The current probe time ownership check for Soft Reserved memory based
>>>> solely on CXL window intersection is insufficient. dax_hmem probing is
>>>> not
>>>> always guaranteed to run after CXL enumeration and region assembly, which
>>>> can lead to incorrect ownership decisions before the CXL stack has
>>>> finished publishing windows and assembling committed regions.
>>>>
>>>> Introduce deferred ownership handling for Soft Reserved ranges that
>>>> intersect CXL windows at probe time by scheduling deferred work from
>>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
>>>> assembly before deciding ownership.
>>>>
>>>> Evaluate ownership of Soft Reserved ranges based on CXL region
>>>> containment.
>>>>
>>>>      - If all Soft Reserved ranges are fully contained within committed
>>>> CXL
>>>>        regions, DROP handling Soft Reserved ranges from dax_hmem and allow
>>>>        dax_cxl to bind.
>>>>
>>>>      - If any Soft Reserved range is not fully claimed by committed CXL
>>>>        region, tear down all CXL regions and REGISTER the Soft Reserved
>>>>        ranges with dax_hmem instead.
>>>
>>> I was not sure if I was understanding this properly, but after looking
>>> at the code I think I do ... but then I do not understand the reason
>>> behind. If I'm right, there could be two devices and therefore different
>>> soft reserved ranges, with one getting an automatic cxl region for all
>>> the range and the other without that, and the outcome would be the first
>>> one getting its region removed and added to hmem. Maybe I'm missing
>>> something obvious but, why? If there is a good reason, I think it should
>>> be documented in the commit and somewhere else.
>> Yeah, if I understood Dan correctly, that's exactly the intended behavior.
>>
>> I'm trying to restate the "why" behind this based on Dan's earlier
>> guidance. Please correct me if I'm misrepresenting it Dan.
>>
>> The policy is meant to be coarse: If all SR ranges that intersect CXL
>> windows are fully contained by committed CXL regions, then we have high
>> confidence that the platform descriptions line up and CXL owns the memory.
>>
>> If any SR range that intersects a CXL window is not fully covered by
>> committed regions then we treat that as unexpected platform shenanigans.
>> In that situation the intent is to give up on CXL entirely for those SR
>> ranges because partial ownership becomes ambiguous.
>>
>> This is why the fallback is global and not per range. The goal is to
>> leave no room for mixed some SR to CXL, some SR to HMEM configurations.
>> Any mismatch should push the platform issue back to the vendor to fix
>> the description (ideally preserving the simplifying assumption of a 1:1
>> correlation between CXL Regions and SR).
>>
>> Thanks for pointing this out. I will update the why in the next revision.
> You have it right. This is mostly a policy to save debug sanity and
> share the compatibility pain. You either always get everything the BIOS
> put into the memory map, or you get the fully enlightened CXL world.
>
> When accelerator memory enters the mix it does require an opt-in/out of
> this scheme. Either the device completely opts out of this HMEM fallback
> mechanism by marking the memory as Reserved (the dominant preference),
> or it arranges for CXL accelerator drivers to be present at boot if they
> want to interoperate with this fallback. Some folks want the fallback:
> https://lpc.events/event/19/contributions/2064/


I will take a look at this presentation, but I think there could be 
another option where accelerators information is obtained during pci 
enumeration by the kernel and using this information by this 
functionality skipping those ranges allocated to them. Forcing them to 
be compiled with the kernel would go against what distributions 
currently and widely do with initramfs. Not sure if some current "early" 
stubs could be used for this though but the information needs to be 
recollected before this code does the checks.


>>> I have also problems understanding the concurrency when handling the
>>> global dax_cxl_mode variable. It is modified inside process_defer_work()
>>> which I think can have different instances for different devices
>>> executed concurrently in different cores/workers (the system_wq used is
>>> not ordered). If I'm right race conditions are likely.
> It only works as a single queue of regions. One sync point to say "all
> collected regions are routed into the dax_hmem or dax_cxl bucket".


That is how I think it should work, handling all the soft reserved 
ranges vs regions by one code execution. But that is not the case. More 
later.


>> Yeah, this is something I spent sometime thinking on. My rationale
>> behind not having it and where I'm still unsure:
>>
>> My assumption was that after wait_for_device_probe(), CXL topology
>> discovery and region commit are complete and stable.
> ...or more specifically, any CXL region discovery after that point is a
> typical runtime dynamic discovery event that is not subject to any
> deferral.
>
>> And each deferred worker should observe the same CXL state and
>> therefore compute the same final policy (either DROP or REGISTER).
> The expectation is one queue, one event that takes the rwsem and
> dispositions all present regions relative to initial soft-reserve memory
> map.
>
>> Also, I was assuming that even if multiple process_defer_work()
>> instances run, the operations they perform are effectively safe to
>> repeat.. though I'm not sure on this.
> I think something is wrong if the workqueue runs more than once. It is
> just a place to wait for initial device probe to complete and then fixup
> all the regions (allow dax_region registration to proceed) that were
> waiting for that.
>
>> cxl_region_teardown_all(): this ultimately triggers the
>> devm_release_action(... unregister_region ...) path. My expectation was
>> that these devm actions are single shot per device lifecycle, so
>> repeated teardown attempts should become noops.
> Not noops, right? The definition of a devm_action is that they always
> fire at device_del(). There is no facility to device_del() a device
> twice.
>
>> cxl_region_teardown_all() ultimately leads to cxl_decoder_detach(),
>> which takes "cxl_rwsem.region". That should serialize decoder detach and
>> region teardown.
>>
>> bus_rescan_devices(&cxl_bus_type): I assumed repeated rescans during
>> boot are fine as the rescan path will simply rediscover already present
>> devices..
> The rescan path likely needs some logic to give up on CXL region
> autodiscovery for devices that failed their memmap compatibility check.
>
>> walk_hmem_resources(.., hmem_register_device): in the DROP case,I
>> thought running the walk multiple times is safe because devm managed
>> platform devices and memregion allocations should prevent duplicate
>> lifetime issues.
>>
>> So, even if multiple process_defer_work() instances execute
>> concurrently, the CXL operations involved in containment evaluation
>> (cxl_region_contains_soft_reserve()) and teardown are already guarded.
>>
>> But I'm still trying to understand if bus_rescan_devices(&cxl_bus_type)
>> is not safe when invoked concurrently?
> It already races today between natural bus enumeration and the
> cxl_bus_rescan() call from cxl_acpi. So it needs to be ok, it is
> naturally synchronized by the region's device_lock and regions' rwsem.
>
>> Or is the primary issue that dax_cxl_mode is a global updated from one
>> context and read from others, and should be synchronized even if the
>> computed final value will always be the same?
> There is only one global hmem_platform device, so only one potential
> item in this workqueue.


Well, I do not think so.


hmem_init() in drivers/dax/device.c walks IORESOURCE_MEM looking for 
IORES_DESC_SOFT_RESERVED descriptors and calling hmem_register_one for 
each of them. That leads to create an hmem_platform platform device (no 
typo, just emphasizing this is a platform device with name 
hmem_platform) so there will be as many hmem_platform devices as 
descriptors found.


Then each hmem_platform probe() will create an hmem platform device, 
where a work will be schedule passing this specific hmem platform device 
as argument. So, each work will check for the specific ranges of its own 
pdev and not all of them. The check can result in a different value 
assigned to dax_cxl_mode leading to potential race conditions with 
concurrent workers and also potentially leaving soft reserved ranges 
without both, a dax or an hmem device.