[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a93beb19-fa0d-4000-812a-a4bfd88d40e5@redhat.com>
Date: Thu, 13 Feb 2025 14:30:59 +0100
From: David Hildenbrand <david@...hat.com>
To: "Luck, Tony" <tony.luck@...el.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: "Moore, Robert" <robert.moore@...el.com>,
"Wysocki, Rafael J" <rafael.j.wysocki@...el.com>, Len Brown
<lenb@...nel.org>, "linux-acpi@...r.kernel.org"
<linux-acpi@...r.kernel.org>,
"acpica-devel@...ts.linux.dev" <acpica-devel@...ts.linux.dev>,
Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
"x86@...nel.org" <x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
Oscar Salvador <osalvador@...e.de>, Danilo Krummrich <dakr@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/4] ACPI/MRRM: Add "node" symlink to
/sys/devices/system/memory/rangeX
On 11.02.25 19:05, Luck, Tony wrote:
>>> What is going to remove this symlink if the memory goes away? Or do
>>> these never get removed?
>>>
>>> symlinks in sysfs created like this always worry me. What is going to
>>> use it?
>>
>> On top of that, we seem to be building a separate hierarchy here.
>>
>> /sys/devices/system/memory/ operates in memory block granularity.
>
> What defines the memory blocks? I'd initially assumed some connection
> to the ACPI SRAT table. But on my test system there are only three
> entries in SRAT that define non-zero sized memory blocks (two on
> socket/node 0 and one on socket/node 1), yet there are:
> memory0 .. memory32 directories
> in /sys/devices/system/memory.
Each memory block is the same size (e.g., 128 MiB .. 2 GiB on x86-64).
The default is memory section granularity (e.g., 128 MiB on x86-64), but
some configs allow for increasing it: see
arch/x86/mm/init_64.c:memory_block_size_bytes(), and in particular
probe_memory_block_size().
They define in the granularity in which we can online/offline/add/remove
physical memory managed by the buddy.
We create these block during boot/during hotplug, and link them to the
relevant nodes.
They do not reflect the HW state, but the state Linux manages that
memory (through the buddy).
>
> The phys_device and phys_index files aren't helping me figure out
> what each of them mean.
Yes, see Documentation/admin-guide/mm/memory-hotplug.rst
phys_device is a legacy thing for s390x, and phys_index is just the
memory block ID.
You can derive the address range corresponding to a memory block using
the ID.
/sys/devices/system/memory/block_size_bytes tells you the size of each
block.
Address range of block X:
[ X*block_size_bytes .. (X+1)*block_size_bytes )
Now, the whole interface her is designed for handling memory hotplug:
obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o
It's worth noting that
1) Blocks might not be all-memory (e.g., memory holes). In that case,
offlining/unplug is not supported.
2) Blocks might span multiple NUMA nodes (e.g., node ends / starts in
the middle of a block). Similarly, in that case
offlining/unplug is not supported.
I assume 1) is not a problem. I assume 2) could be a problem for your
use case.
>
>> /sys/devices/system/node/nodeX/ links to memory blocks that belong to it.
>>
>> Why is the memory-block granularity insufficient, and why do we have to
>> squeeze in another range API here?
>
> If an MRRM range consists of some set of memory blocks (making
> sure that no memory block spans across MRRM range boundaries,
> then I could add the {local,remote}_region_id files into the memory
> block directories.
>
> This could work now while the region assignments are done by the
> BIOS. But in the future when OS gets the opportunity to change them
> it might be weird if an MRRM range consists of multiple memory
> block range, since the region_ids in each all update together.
What about memory ranges not managed by the buddy (e.g., dax/pmem ranges
not exposed to the buddy through dax/kmem driver, memory hidden from
Linux using mem=X etc.)?
>
> /sys/devices/system/memory seemed like a logical place for
> memory ranges. But should I jump up a level and make a new
> /sys/devices/system/memory_regions directory to expose these
> ranges?
Let's take one step back. We do have
1) /proc/iomem to list physical device ranges, without a notion of nodes
/ other information. Maybe we could extend it, but it might be hard.
Depending on *what* information we need to expose and for which memory.
/proc/iomem also doesn't indicate "System RAM" for memory not managed by
the buddy.
2) /sys/devices/system/memory/memoryX and /sys/devices/system/node/
Again, the memory part is more hotplugged focused, and we treat
individual memory blocks as "memory block devices".
Reading:
"
The MRRM solution is to tag physical address ranges with "region IDs"
so that platform firmware[1] can indicate the type of memory for each
range (with separate tags available for local vs. remote access to
each range).
The region IDs will be used to provide separate event counts for each
region for "perf" and for the "resctrl" file system to monitor and
control memory bandwidth in each region.
Users will need to know the address range(s) that are part of each
region."
A couple of questions:
a) How volatile is that information at runtime? Can ranges / IDs change?
I read above that user space might in the future be able to
reconfigure the ranges.
b) How is hotplug/unplug handled?
c) How are memory ranges not managed by Linux handled?
It might make sense to expose what you need in a more specialized,
acpi/MRRM/perf specific form, and not as generic as you currently
envision it.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists