[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sf4i2xe1.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Mon, 04 Dec 2023 16:19:02 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Michal Hocko <mhocko@...e.com>, "tj@...nel.org" <tj@...nel.org>,
"John Groves" <john@...alactic.com>,
Gregory Price <gourry.memverge@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"lizefan.x@...edance.com" <lizefan.x@...edance.com>,
"hannes@...xchg.org" <hannes@...xchg.org>,
"corbet@....net" <corbet@....net>,
"roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
"shakeelb@...gle.com" <shakeelb@...gle.com>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>,
"jgroves@...ron.com" <jgroves@...ron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control
Gregory Price <gregory.price@...verge.com> writes:
> On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@...verge.com> writes:
>>
>> Because we usually have multiple nodes in one mem-tier, I still think
>> mem-tier-based interface is simpler than node-based. But, it seems more
>> complex to introduce mem-tier into mempolicy. Especially if we have
>> per-task weights. So, I am fine to go with node-based interface.
>>
>> > * cgroups: "this doesn't involve dynamic resource accounting /
>> > enforcement at all" and "these aren't resource
>> > allocations, it's unclear what the hierarchical
>> > relationship mean".
>> >
>> > * node: too global, explore smaller scope first then expand.
>>
>> Why is it too global? I understand that it doesn't cover all possible
>> use cases (although I don't know whether these use cases are practical
>> or not). But it can provide a reasonable default per-node weight based
>> on available node performance information (such as, HMAT, CDAT, etc.).
>> And, quite some workloads can just use it. I think this is an useful
>> feature.
>>
>
> Have been sharing notes with more folks. Michal thinks a global set of
> weights is unintuitive and not useful, and would prefer to see the
> per-task weights first.
>
> Though this may have been in response to adding it as an attribute of
> nodes directly.
>
> Another proposal here suggested adding a new sysfs setting
> https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a
>
> $ tree /sys/kernel/mm/interleave_weight/
> /sys/kernel/mm/interleave_weight/
> ├── enabled [1]
> ├── possible [2]
> └── node
> ├── node0
> │ └── interleave_weight [3]
> └── node1
> └── interleave_weight [3]
>
> (this could be changed to /sys/kernel/mm/mempolicy/...)
>
> I think the internal representation of this can be simplified greatly,
> over what the patch provides now, but maybe this solves the "it doesn't
> belong in these other components" issue.
>
> Answer: Simply leave it as a static global kobject in mempolicy, which
> also deals with many of the issues regarding race conditions.
Although personally I prefer to add interleave weight as an attribute of
nodes. I understand that some people think it's not appropriate to
place anything node-specific there. So, some place under /sys/kernel/mm
sounds reasonable too.
> If a user provides weights, use those. If they do not, use globals.
Yes. That is the target use case.
> On a cpuset rebind event (container migration, mems_allowed changes),
> manually set weights would have to remain, so in a bad case, the
> weights would be very out of line with the real distribution of memory.
>
> Example: if your nodemask is (0,1,2) and a migration changes it to
> (3,4,5), then unfortunately your weights will likely revert to [1,1,1]
>
> If set with global weights, they could automatically adjust. It
> would not be perfect, but it would be better than the potential worst
> case above. If that same migration occurs, the next allocation would
> simply use whatever the target node weights are in the global config.
>
> So if globally you have weights [3,2,1,1,2,3], and you move from
> nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
> [1,2,3].
That is nice. And I prefer to emphasize the simple use case. Users
don't need to specify interleave weight always. Just use
MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable
default weight.
> If the structure is built as a matrix of (cpu_node,mem_nodes),
> the you can also optimize based on the node the task is running on.
The matrix stuff makes the situation complex. If people do need
something like that, they can just use set_memorypolicy2() with user
specified weights. I still believe that "make simple stuff simple, and
complex stuff possible".
> That feels very intuitive, deals with many race condition issues, and
> the global setting can actually be implemented without the need for
> set_mempolicy2 at all - which is certainly a bonus.
>
> Would love more thoughts here. Will have a new RFC with set_mempolicy2,
> mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.
Thanks for doing all these!
--
Best Regards,
Huang, Ying
Powered by blists - more mailing lists