[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87fs1nz3ee.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Fri, 03 Nov 2023 15:45:13 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Michal Hocko <mhocko@...e.com>,
Johannes Weiner <hannes@...xchg.org>,
Gregory Price <gourry.memverge@...il.com>,
<linux-kernel@...r.kernel.org>, <linux-cxl@...r.kernel.org>,
<linux-mm@...ck.org>, <akpm@...ux-foundation.org>,
<aneesh.kumar@...ux.ibm.com>, <weixugc@...gle.com>,
<apopple@...dia.com>, <tim.c.chen@...el.com>,
<dave.hansen@...el.com>, <shy828301@...il.com>,
<gregkh@...uxfoundation.org>, <rafael@...nel.org>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
Gregory Price <gregory.price@...verge.com> writes:
> On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
>> On Wed 01-11-23 12:58:55, Gregory Price wrote:
>> > Basically consider: `numactl --interleave=all ...`
>> >
>> > If `--weights=...`: when a node hotplug event occurs, there is no
>> > recourse for adding a weight for the new node (it will default to 1).
>>
>> Correct and this is what I was asking about in an earlier email. How
>> much do we really need to consider this setup. Is this something nice to
>> have or does the nature of the technology requires to be fully dynamic
>> and expect new nodes coming up at any moment?
>>
>
> Dynamic Capacity is expected to cause a numa node to change size (in
> number of memory blocks) rather than cause numa nodes to come and go, so
> maybe handling the full node hotplug is a bit of an overreach.
Will node max bandwidth change with the number of memory blocks?
> Good call, I'll stop considering this problem for now.
>
>> > If the node is removed from the system, I believe (need to validate
>> > this, but IIRC) the node will be removed from any registered cpusets.
>> > As a result, that falls down to mempolicy, and the node is removed.
>>
>> I do not think we do anything like that. Userspace might decide to
>> change the numa mask when a node is offlined but I do not think we do
>> anything like that automagically.
>>
>
> mpol_rebind_policy called by update_tasks_nodemask
> https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016
>
> falls down from cpuset_hotplug_workfn:
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771
>
> /*
> * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
> * Call this routine anytime after node_states[N_MEMORY] changes.
> * See cpuset_update_active_cpus() for CPU hotplug handling.
> */
> static int cpuset_track_online_nodes(struct notifier_block *self,
> unsigned long action, void *arg)
> {
> schedule_work(&cpuset_hotplug_work);
> return NOTIFY_OK;
> }
>
> void __init cpuset_init_smp(void)
> {
> ...
> hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
> }
>
>
> Causes 1 of 3 situations:
> MPOL_F_STATIC_NODES: overwrite with (old & new)
> MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?)
> Default: either does a remap or replaces old with new.
>
> My assumption based on this is that a hot-unplugged node would completely
> be removed. Doesn't look like hot-add is handled at all, so I can just
> drop that entirely for now (except add default weight of 1 incase it is
> ever added in the future).
>
> I've been pushing agianst the weights being in memory-tiers.c for this
> reason, as a weight set per-tier is meaningless if a node disappears.
>
> Example: Tier has 2 nodes with some weight N split between them, such
> that interleave gives each node N/2 pages. If 1 node is removed, the
> remaining node gets N pages, which is twice the allocation. Presumably
> a node is an abstraction of 1 or more devices, therefore if the node is
> removed, the weight should change.
The per-tier weight can be defined as interleave weight of each node of
the tier. Tier just groups NUMA nodes with similar performance. The
performance (including bandwidth) is still per-node in the context of
tier.
If we have multiple nodes in one tier, this makes weight definition
easier.
> You could handle hotplug in tiers, but if a node being hotplugged forcibly
> removes the node from cpusets and mempolicy nodemasks, then it's
> irrelevant since the node can never get selected for allocation anyway.
>
> It's looking more like cgroups is the right place to put this.
Have a cgroup/task level interface doesn't prevent us to have a system
level interface to provide default for cgroups/tasks. Where performance
information (e.g., from HMAT) can help define a reasonable default
automatically.
>>
>> Moving the global policy to cgroups would make the main cocern of
>> different workloads looking for different policy less problamatic.
>> I didn't have much time to think that through but the main question is
>> how to sanely define hierarchical properties of those weights? This is
>> more of a resource distribution than enforcement so maybe a simple
>> inherit or overwrite (if you have a more specific needs) semantic makes
>> sense and it is sufficient.
>>
>
> As a user I would assume it would operate much the same way as other
> nested cgroups, which is inherit by default (with subsets) or an
> explicit overwrite that can't exceed the higher level settings.
>
> Weights could arguably allow different settings than capacity controls,
> but that could be an extension.
>
>> This is not as much about the code as it is about the proper interface
>> because that will get cast in stone once introduced. It would be really
>> bad to realize that we have a global policy that doesn't fit well and
>> have hard time to work it around without breaking anybody.
>
> o7 I concur now. I'll take some time to rework this into a
> cgroups+mempolicy proposal based on my earlier RFCs.
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists