linux-kernel - Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Sat, 11 Nov 2023 01:16:29 -1000
From:   "tj@...nel.org" <tj@...nel.org>
To:     Gregory Price <gregory.price@...verge.com>
Cc:     John Groves <john@...alactic.com>,
        Gregory Price <gourry.memverge@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "ying.huang@...el.com" <ying.huang@...el.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "mhocko@...nel.org" <mhocko@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "jgroves@...ron.com" <jgroves@...ron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

Hello,

On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote:
> On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@...nel.org wrote:
...
> I've been considering this as well, but there's more context here being
> lost.  It's not just about being able to toggle the policy of a single
> task, or related tasks, but actually in support of a more global data
> interleaving strategy that makes use of bandwidth more effectively as
> we begin to memory expansion and bandwidth expansion occur on the
> PCIE/CXL bus.
> 
> If the memory landscape of a system changes, for example due to a
> hotplug event, you actually want to change the behavior of *every* task
> that is using interleaving.  The fundamental bandwidth distribution of
> the entire system changed, so the behavior of every task using that
> memory should change with it.
> 
> We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
> and now additionally cpusets. In the last email, I'd asked whether it
> might actually be worth adding a new mpol component of cgroups to
> aggregate these issues, rather than jam them into either component.
> I would love your thoughts on that.

As for CXL and the changing memory landscape, I think some caution is
necessary as with any expected "future" technology changes. The recent
example with non-volatile memory isn't too far from CXL either. Note that
this is not to say that we shouldn't change anything until the hardware is
wildly popular but more that we need to be cognizant of the speculative
nature and the possibility of overbuilding for it.

I don't have a golden answer but here are general suggestions: Build
something which is small and/or useful even outside the context of the
expected hardware landscape changes. Enable the core feature which is
absolutely required in a minimal manner. Avoid being maximalist in feature
and convenience coverage.

Here, even if CXL actually becomes popular, how many are going to use memory
hotplug and need to dynamically rebalance memory in actively running
workloads? What's the scenario? Are there going to be an army of data center
technicians going around plugging and unplugging CXL devices depending on
system memory usage?

Maybe there are some cases this is actually useful but for those niche use
cases, isn't per-task interface with iteration enough? How often are these
hotplug events going to be?

> > > So one concrete use case: kubernetes might like change cpusets or move
> > > tasks from one cgroup to another, or a vm might be migrated from one set
> > > of nodes to enother (technically not mutually exclusive here).  Some
> > > memory policy settings (like weights) may no longer apply when this
> > > happens, so it would be preferable to have a way to change them.
> > 
> > Neither covers all use cases. As you noted in your mempolicy message, if the
> > application wants finer grained control, cgroup interface isn't great. In
> > general, any changes which are dynamically initiated by the application
> > itself isn't a great fit for cgroup.
> 
> It is certainly simple enough to add weights to mempolicy, but there
> are limitations.  In particular, mempolicy is extremely `current task`
> focused, and significant refactor work would need to be done to allow
> external tasks the ability to toggle a target task's mempolicy.
> 
> In particular I worry about the potential concurrency issues since
> mempolicy can be in the hot allocation path.

Changing mpol from outside the task is a feature which is inherently useful
regardless of CXL and I don't quite understand why hot path concurrency
issues would be different whether the configuration is coming from mempol or
cgroup but that could easily be me not being familiar with the involved
code.

...
> > 3. Cgroup can be convenient when group config change is necessary. However,
> >    we really don't want to keep adding kernel interface just for changing
> >    configs for a group of threads. For config changes which aren't high
> >    frequency, userspace iterating the member processes and applying the
> >    changes if possible is usually good enough which usually involves looping
> >    until no new process is found. If the looping is problematic, cgroup
> >    freezer can be used to atomically stop all member threads to provide
> >    atomicity too.
> > 
> 
> If I can ask, do you think it would be out of line to propose a major
> refactor to mempolicy to enable external task's the ability to change a
> running task's mempolicy *as well as* a cgroup-wide mempolicy component?

I don't think these group configurations fit cgroup filesystem interface
very well. As these aren't resource allocations, it's unclear what the
hierarchical relationship means. Besides, it feels awkard to be keep adding
duplicate interfaces where the modality changes completely based on the
operation scope.

There are ample examples where other subsystems use cgroup membership
information and while we haven't expanded that to syscalls yet, I don't see
why that'd be all that difference. So, maybe it'd make sense to have the new
mempolicy syscall take a cgroup ID as a target identifier too? ie. so that
the scope of the operation (e.g. task, process, cgroup) and the content of
the policy can stay orthogonal?

Thanks.

-- 
tejun