linux-kernel - Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87o7fveeze.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date:   Wed, 15 Nov 2023 13:56:53 +0800
From:   "Huang, Ying" <ying.huang@...el.com>
To:     Gregory Price <gregory.price@...verge.com>,
        Michal Hocko <mhocko@...e.com>
Cc:     "tj@...nel.org" <tj@...nel.org>, John Groves <john@...alactic.com>,
        Gregory Price <gourry.memverge@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "jgroves@...ron.com" <jgroves@...ron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Gregory Price <gregory.price@...verge.com> writes:

> On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
>> On Tue 14-11-23 10:50:51, Gregory Price wrote:
>> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
>> [...]
>> > > That being said, I still believe that a cgroup based interface is a much
>> > > better choice over a global one. Cpusets seem to be a good fit as the
>> > > controller does control memory placement wrt NUMA interfaces.
>> > 
>> > I think cpusets is a non-starter due to the global spinlock required when
>> > reading informaiton from it:
>> > 
>> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
>> 
>> Right, our current cpuset implementation indeed requires callback lock
>> from the page allocator. But that is an implementation detail. I do not
>> remember bug reports about the lock being a bottle neck though. If
>> anything cpusets lock optimizations would be win also for users who do
>> not want to use weighted interleave interface.
>
> Definitely agree, but that's a rather large increase of scope :[
>
> We could consider a push-model similar to how cpuset nodemasks are
> pushed down to mempolicies, rather than a pull-model of having
> mempolicy read directly from cpusets, at least until cpusets lock
> optimization is undertaken.
>
> This pattern looks like a wart to me, which is why I avoided it, but the
> locking implications on the pull-model make me sad.
>
> Would like to point out that Tejun pushed back on implementing weights
> in cgroups (regardless of subcomponent), so I think we need to come
> to a consensus on where this data should live in a "more global"
> context (cpusets, memcg, nodes, etc) before I go mucking around
> further.
>
> So far we have:
> * mempolicy: updating weights is a very complicated undertaking,
>              and no (good) way to do this from outside the task.
> 	     would be better to have a coarser grained control.
>
>              New syscall is likely needed to add/set weights in the
> 	     per-task mempolicy, or bite the bullet on set_mempolicy2
> 	     and make the syscall extensible for the future.
>
> * memtiers: tier=node when devices are already interleaved or when all
>             devices are different, so why add yet another layer of
> 	    complexity if other constructs already exist.  Additionally,
> 	    you lose task-placement relative weighting (or it becomes
> 	    very complex to implement.

Because we usually have multiple nodes in one mem-tier, I still think
mem-tier-based interface is simpler than node-based.  But, it seems more
complex to introduce mem-tier into mempolicy.  Especially if we have
per-task weights.  So, I am fine to go with node-based interface.

> * cgroups: "this doesn't involve dynamic resource accounting /
>             enforcement at all" and "these aren't resource
> 	    allocations, it's unclear what the hierarchical
> 	    relationship mean".
>
> * node: too global, explore smaller scope first then expand.

Why is it too global?  I understand that it doesn't cover all possible
use cases (although I don't know whether these use cases are practical
or not).  But it can provide a reasonable default per-node weight based
on available node performance information (such as, HMAT, CDAT, etc.).
And, quite some workloads can just use it.  I think this is an useful
feature.

> For now I think there is consensus that mempolicy should have weights
> per-task regardless of how the more-global mechanism is defined, so i'll
> go ahead and put up another RFC for some options on that in the next
> week or so.
>
> The limitations on the first pass will be that only the task is capable
> of re-weighting should cpusets.mems or the nodemask change.

--
Best Regards,
Huang, Ying