[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <pmxrljwp4ayl3fcu7rxm6prbumgb5l3lwb75lqfipmxxxwnqfo@nb5qjcxw22gp>
Date: Wed, 1 Nov 2023 14:45:50 +0100
From: Michal Hocko <mhocko@...e.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Johannes Weiner <hannes@...xchg.org>,
Gregory Price <gourry.memverge@...il.com>,
linux-kernel@...r.kernel.org, linux-cxl@...r.kernel.org,
linux-mm@...ck.org, ying.huang@...el.com,
akpm@...ux-foundation.org, aneesh.kumar@...ux.ibm.com,
weixugc@...gle.com, apopple@...dia.com, tim.c.chen@...el.com,
dave.hansen@...el.com, shy828301@...il.com,
gregkh@...uxfoundation.org, rafael@...nel.org
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
On Tue 31-10-23 00:27:04, Gregory Price wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>
> > > This hopefully also explains why it's a global setting. The usecase is
> > > different from conventional NUMA interleaving, which is used as a
> > > locality measure: spread shared data evenly between compute
> > > nodes. This one isn't about locality - the CXL tier doesn't have local
> > > compute. Instead, the optimal spread is based on hardware parameters,
> > > which is a global property rather than a per-workload one.
> >
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> >
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> >
>
> I had originally implemented it this way while experimenting with new
> mempolicies.
>
> https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/
>
> The downside of doing it in mempolicy is...
> 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> non-trivial task. It is very "current-task" centric.
True. Cpusets is the way to make it less process centric but that comes
with its own constains (namely which NUMA policies are supported).
> 2) Barring a change to mempolicy to be sysfs friendly, the options for
> implementing weights in the mempolicy are either a) new flag and
> setting every weight individually in many syscalls, or b) a new
> syscall (set_mempolicy2), which is what I demonstrated in the RFC.
Yes, that would likely require a new syscall.
> 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> end up with a rats nest of interactions between mempolicy nodemasks
> changing as a result of cgroup migrations, nodes potentially coming
> and going (hotplug under CXL), and others I'm probably forgetting.
Is this really any different from what you are proposing though?
> Basically: If a node leaves the nodemask, should you retain the
> weight, or should you reset it? If a new node comes into the node
> mask... what weight should you set? I did not have answers to these
> questions.
I am not really sure I follow you. Are you talking about cpuset
nodemask changes or memory hotplug here.
> It was recommended to explore placing it in tiers instead, so I took a
> crack at it here:
>
> https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
>
> This had similar issue with the idea of hotplug nodes: if you give a
> tier a weight, and one or more of the nodes goes away/comes back... what
> should you do with the weight? Split it up among the remaining nodes?
> Rebalance? Etc.
How is this any different from node becoming depleted? You cannot
really expect that you get memory you are asking for and you can easily
end up getting memory from a different node instead.
> The result of this discussion lead us to simply say "What if we place
> the weights directly in the node". And that lead us to this RFC.
Maybe I am missing something really crucial here but I do not see how
this fundamentally changes anything.
Memory hotremove (or mere node memory depletion) is not really a thing
because interleaving is a best effort operation so you have to live with
memory not being strictly distributed per your preferences.
Memory hotadd will be easier to manage because you just update a single
place after node is hotadded rather than gazillions partial policies.
But, that requires that interleave policy nodemask is assuming future
nodes going online and put them to the mask.
> I am not against implementing it in mempolicy (as proof: my first RFC).
> I am simply searching for the acceptable way to implement it.
>
> One of the benefits of having it set as a global setting is that weights
> can be automatically generated from HMAT/HMEM information (ACPI tables)
> and programs already using MPOL_INTERLEAVE will have a direct benefit.
Right. This is understood. My main concern is whether this is outweights
the limitations of having a _global_ policy _only_. Historically a single
global policy usually led to finding ways how to make that more scoped
(usually through cgroups).
> I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
> along side this patch so that MPOL_INTERLEAVE is left entirely alone.
>
> Happy to discuss more,
> ~Gregory
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists