[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <496e6707-bdc9-4ad2-88e2-51236549b5f2@redhat.com>
Date: Wed, 21 May 2025 14:33:42 +0200
From: David Hildenbrand <david@...hat.com>
To: Sumanth Korikkar <sumanthk@...ux.ibm.com>
Cc: linux-mm <linux-mm@...ck.org>, Andrew Morton <akpm@...ux-foundation.org>,
Oscar Salvador <osalvador@...e.de>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
linux-s390 <linux-s390@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 1/4] mm/memory_hotplug: Add interface for runtime
(de)configuration of memory
On 21.05.25 12:34, Sumanth Korikkar wrote:
>>> Introduce new interface on s390 with the following attributes:
>>>
>>> 1) Attribute1:
>>> /sys/firmware/memory/block_size_bytes
>>
>> I assume this will be the storage increment size.
>
> Hi David,
>
> No, this is memory block size.
So, the same as /sys/devices/system/memory/block_size_bytes ?
In a future where we could have variable sized memory blocks, what would
be the granularity here?
>
>>>> 2) Attribute2:
>>> /sys/firmware/memory/memoryX/config
>>> echo 0 > /sys/firmware/memory/memoryX/config -> deconfigure memoryX
>>> echo 1 > /sys/firmware/memory/memoryX/config -> configure memoryX
>>
>> And these would configure individual storage increments, essentially calling
>> add_memory() and (if possible because we could offline the memory)
>> remove_memory().
>
> configure or deconfigure memory in units of entire memory blocks.
I assume, because that is assumed to be the smallest granularity in
which we can add_memory().
And the memory block size is currently always at least the storage
increment size, correct?
>
> As I understand it, add_memory() operates on memory block granularity,
> and this is enforced by check_hotplug_memory_range(), which ensures the
> requested range aligns with the memory block size.
Yes. I was rather wondering, if we could have storage increment size >
memory block size.
>
>>> 3) Attribute3:
>>> /sys/firmware/memory/memoryX/altmap_required
>>> echo 0 > /sys/firmware/memory/memoryX/altmap_required -> noaltmap
>>> echo 1 > /sys/firmware/memory/memoryX/altmap_required -> altmap
>>> echo N > /sys/firmware/memory/memoryX/altmap_required -> variable size
>>> altmap grouping (possible future requirements),
>>> where N specifies the number of memory blocks that the current
>>> memory block manages altmap. There are two possibilities here:
>>> * If the altmap cannot fit entirely within memoryX, it can
>>> extend into memoryX+1, meaning the altmap metadata will span
>>> across multiple memory blocks.
>>> * If the altmap for memory range cannot fit within memoryX,
>>> then config will return -EINVAL.
>>
>> Do we really still need this when we can configure/deconfigure?
>>
>> I mean, on s390x, the most important use case for memmap-on-memory was not
>> wasting memory for offline memory blocks.
>>
>> But with a configuration interface like this ... the only benefit is being
>> able to more-reliably add memory in low-memory conditions. An unlikely
>> scenario with standby storage IMHO.
>>
>> Note that I dislike exposing "altmap" to the user :) Dax calls it
>> "memmap_on_memory", and it is a device attrivute.
>>
>> As soon as we go down that path we have the complexity of having to group
>> memory blocks etc, and if we can just not go down that path right now it
>> will make things a lot simpler.
>>
>> (especially, as you document above, the semantics become *really* weird)
>>
>> As yet another point, I am not sure if someone really needs a per-memory
>> block control of the memmap-on-memory feature.
>>
>> If we could simplify here, that would be great ...
>
> The original motivation for introducing memmap_on_memory on s390 was to
> avoid using online memory to store struct page metadata, particularly
> for standby memory blocks.
Right, when they were added but not online (memory not usable).
> This became critical in cases where there was
> an imbalance between standby and online memory, potentially leading to
> boot failures due to insufficient memory for metadata allocation.
Right, too much memory wasted on unused memmaps.
>
> To address this, memmap_on_memory was utilized on s390. However, in its
> current form, it adds altmap metadata at the start of each memory block
> at the time of addition, and this configuration is static. It cannot be
> changed at runtime.
Yes.
>
> I was wondering about the following practical scenario:
>
> When online memory is nearly full, the user can add a standby memory
> block with memmap_on_memory enabled. This allows the system to avoid
> consuming already scarce online memory for metadata.
Right, that's the use case I mentioned. But we're talking about ~ 2/4
MiB on s390x for a single memory block. There are other things we have
to allocate memory for when onlining memory, so there is no guarantee
that it would work with memmap_on_memory either.
It makes it more likely to succeed :)
>
> After enabling and bringing that standby memory online, the user now
> has enough free online memory to add additional memory blocks without
> memmap_on_memory. These later blocks can provide physically contiguous
> memory, which is important for workloads or devices requiring continuous
> physical address space.
>
> If my interpretation is correct, I see good potential for this be be
> useful.
Again, I think only in the case where we don't have have 2/4 MiB for the
memmap.
If this is triggered from inside the VM, might just be that the admin
can not even login anymore and trigger this if really close to OOM ...
>
> As you pointed out, how about having something similar to
> 73954d379efd ("dax: add a sysfs knob to control memmap_on_memory behavior")
Right. But here, the use case is usually (a) to add a gigantic amount of
memory using add_memory(), not small blocks like on s390x (b) consume
the memmap from (slow) special-purpose memory as well.
Regarding (a), the memmap could be so big that add_memory() might never
really work (not just because of some temporary low-memory situation).
>
> i.e.
>
> 1) To configure/deconfigure a memory block
> /sys/firmware/memory/memoryX/config
>
> 1 -> configure
> 0 -> deconfigure
>
> 2) Determine whether memory block should have memmap_on_memory or not.
> /sys/firmware/memory/memoryX/memmap_on_memory
> 1 -> with altmap
> 0 -> without altmap
>
> This attribute must be set before the memoryX is configured. Or else, it
> will default to CONFIG_MHP_MEMMAP_ON_MEMORY / memmap_on_memory parameter.
I don't have anything against that option. Just a thought if we really
have to introduce this right now.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists