[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Wed, 15 Nov 2023 13:56:53 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>,
Michal Hocko <mhocko@...e.com>
Cc: "tj@...nel.org" <tj@...nel.org>, John Groves <john@...alactic.com>,
Gregory Price <gourry.memverge@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"lizefan.x@...edance.com" <lizefan.x@...edance.com>,
"hannes@...xchg.org" <hannes@...xchg.org>,
"corbet@....net" <corbet@....net>,
"roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
"shakeelb@...gle.com" <shakeelb@...gle.com>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>,
"jgroves@...ron.com" <jgroves@...ron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control
Gregory Price <gregory.price@...verge.com> writes:
> On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
>> On Tue 14-11-23 10:50:51, Gregory Price wrote:
>> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
>> [...]
>> > > That being said, I still believe that a cgroup based interface is a much
>> > > better choice over a global one. Cpusets seem to be a good fit as the
>> > > controller does control memory placement wrt NUMA interfaces.
>> >
>> > I think cpusets is a non-starter due to the global spinlock required when
>> > reading informaiton from it:
>> >
>> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
>>
>> Right, our current cpuset implementation indeed requires callback lock
>> from the page allocator. But that is an implementation detail. I do not
>> remember bug reports about the lock being a bottle neck though. If
>> anything cpusets lock optimizations would be win also for users who do
>> not want to use weighted interleave interface.
>
> Definitely agree, but that's a rather large increase of scope :[
>
> We could consider a push-model similar to how cpuset nodemasks are
> pushed down to mempolicies, rather than a pull-model of having
> mempolicy read directly from cpusets, at least until cpusets lock
> optimization is undertaken.
>
> This pattern looks like a wart to me, which is why I avoided it, but the
> locking implications on the pull-model make me sad.
>
> Would like to point out that Tejun pushed back on implementing weights
> in cgroups (regardless of subcomponent), so I think we need to come
> to a consensus on where this data should live in a "more global"
> context (cpusets, memcg, nodes, etc) before I go mucking around
> further.
>
> So far we have:
> * mempolicy: updating weights is a very complicated undertaking,
> and no (good) way to do this from outside the task.
> would be better to have a coarser grained control.
>
> New syscall is likely needed to add/set weights in the
> per-task mempolicy, or bite the bullet on set_mempolicy2
> and make the syscall extensible for the future.
>
> * memtiers: tier=node when devices are already interleaved or when all
> devices are different, so why add yet another layer of
> complexity if other constructs already exist. Additionally,
> you lose task-placement relative weighting (or it becomes
> very complex to implement.
Because we usually have multiple nodes in one mem-tier, I still think
mem-tier-based interface is simpler than node-based. But, it seems more
complex to introduce mem-tier into mempolicy. Especially if we have
per-task weights. So, I am fine to go with node-based interface.
> * cgroups: "this doesn't involve dynamic resource accounting /
> enforcement at all" and "these aren't resource
> allocations, it's unclear what the hierarchical
> relationship mean".
>
> * node: too global, explore smaller scope first then expand.
Why is it too global? I understand that it doesn't cover all possible
use cases (although I don't know whether these use cases are practical
or not). But it can provide a reasonable default per-node weight based
on available node performance information (such as, HMAT, CDAT, etc.).
And, quite some workloads can just use it. I think this is an useful
feature.
> For now I think there is consensus that mempolicy should have weights
> per-task regardless of how the more-global mechanism is defined, so i'll
> go ahead and put up another RFC for some options on that in the next
> week or so.
>
> The limitations on the first pass will be that only the task is capable
> of re-weighting should cpusets.mems or the nodemask change.
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists