linux-kernel - Re: [RFC] Disable auto_movable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <09794c70-06a2-44dc-8e54-bc6e6a7d6c74@redhat.com>
Date: Mon, 28 Jul 2025 17:15:02 +0200
From: David Hildenbrand <david@...hat.com>
To: Hannes Reinecke <hare@...e.de>, Michal Hocko <mhocko@...e.com>
Cc: Oscar Salvador <osalvador@...e.de>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, Hannes Reinecke <hare@...nel.org>
Subject: Re: [RFC] Disable auto_movable_ratio for selfhosted memmap

On 28.07.25 11:37, Hannes Reinecke wrote:
> On 7/28/25 11:10, David Hildenbrand wrote:
>> On 28.07.25 11:04, Michal Hocko wrote:
>>> On Mon 28-07-25 10:53:08, David Hildenbrand wrote:
>>>> On 28.07.25 10:48, Michal Hocko wrote:
>>>>> On Mon 28-07-25 10:15:47, Oscar Salvador wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently, we have several mechanisms to pick a zone for the new
>>>>>> memory we are
>>>>>> onlining.
>>>>>> Eventually, we will land on zone_for_pfn_range() which will pick
>>>>>> the zone.
>>>>>>
>>>>>> Two of these mechanisms are 'movable_node' and 'auto-movable' policy.
>>>>>> The former will put every single hotpluggled memory in ZONE_MOVABLE
>>>>>> (unless we can keep zones contiguous by not doing so), while the
>>>>>> latter
>>>>>> will put it in ZONA_MOVABLE IFF we are within the established ratio
>>>>>> MOVABLE:KERNEL.
>>>>>>
>>>>>> It seems, the later doesn't play well with CXL memory where CXL
>>>>>> cards hold really
>>>>>> large amounts of memory, making the ratio fail, and since CXL cards
>>>>>> must be removed
>>>>>> as a unit, it can't be done if any memory block fell within
>>>>>> !ZONE_MOVABLE zone.
>>>>>
>>>>> I suspect this is just an example of how our existing memory hotplug
>>>>> interface based on memory blocks is just suoptimal and it doesn't fit
>>>>> new usecases. We should start thinking about how a new v2 api should
>>>>> look like. I am not sure how that should look like but I believe we
>>>>> should be able to express a "device" as whole rather than having a very
>>>>> loosely bound generic memblocks. Anyway this is likely for a longer
>>>>> discussion and a long term plan rather than addressing this particular
>>>>> issue.
>>>>
>>>> We have that concept with memory groups in the kernel already.
>>>
>>> I must have missed that. I will have a look, thanks! Do we have any
>>> documentation for that? Memory group is an overloaded term in the
>>> kernel.
>>
>> It's an internal concept so far, the grouping is not exposed to user space.
>>
>> We have kerneldoc for e.g., "struct memory_group". E.g., from there
>>
>> "A memory group logically groups memory blocks; each memory block
>> belongs to at most one memory group. A memory group corresponds to a
>> memory device, such as a DIMM or a NUMA node, which spans multiple
>> memory blocks and might even span multiple non-contiguous physical
>> memory ranges."
>>
>>>
>>>> In dax/kmem we register a static memory group. It will be considered one
>>>> union.
>>>
>>> But we still do export those memory blocks and let udev or whoever act
>>> on those right? If that is the case then ....
>>
>> Yes.
>>
>>>
>>> [...]
>>>
>>>> daxctl wants to online memory itself. We want to keep that memory
>>>> offline
>>>> from a kernel perspective and let daxctl handle it in this case.
>>>>
>>>> We have that problem in RHEL where we currently require user space to
>>>> disable udev rules so daxctl "can win".
>>>
>>> ... this is the result. Those shouldn't really race. If udev is suppose
>>> to see the device then only in its entirity so regular memory block
>>> based onlining rules shouldn't even see that memory. Or am I completely
>>> missing the picture?
>>
>> We can't break user space, which relies on individual memory blocks.
>>
>> So udev or $whatever will right now see individual memory blocks. We
>> could export the group id to user space if that is of any help, but at
>> least for daxctl purposes, it will be sufficient to identify "oh, this
>> was added by dax/kmem" (which we can obtain from /proc/iomem) and say
>> "okay, I'll let user-space deal with it."
>>
>> Having the whole thing exposed as a unit is not really solving any
>> problems unless I am missing something important.
>>
> Basically it boils down to:
> Who should be responsible for onlining the memory?
> 
> As it stands we have two methods:
> - user-space as per sysfs attributes
> - kernel policy
> 
> And to make matters worse, we have two competing user-space programs:
> - udev
> - daxctl
> neither of which is (or can be made) aware of each other.
> This leads to races and/or inconsistencies.
> 
> As we've seen the current kernel policy (cf the 'ratio' discussion)
> doesn't really fit how users expect CXL to work, so one is tempted to
> not having the kernel to do the onlining. But then the user is caught
> in the udev vs daxctl race, requiring awkward cludges on either side.
 > > Can't we make daxctl aware of udev? IE updating daxctl call out to
> udev and just wait for udev to complete its thing?
> At worst we're running into a timeout if some udev rules are garbage,
> but daxctl will be able to see the final state and we would avoid
> the need for modifying and/or moving udev rules.
> (Which, incidentally, is required on SLES, too :-)

I will try moving away from udev for memory onlining completely in RHEL 
-- let's see if I will succeed ;) .

We really want to make use of auto-onlining in the kernel where 
possible, and do it manually in user space only in a handful of cases 
(e.g., CXL, standby memory on s390x). Configuring auto-onlining is the 
nasty bit that still needs to be done manually by the admin, and that's 
really the nasty bit.


> 
> Discussion point for LPC?

Yes, probably.

-- 
Cheers,

David / dhildenb