linux-kernel - Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6550144fb048d_46f0294be@dwillia2-mobl3.amr.corp.intel.com.notmuch>
Date:   Sat, 11 Nov 2023 15:54:55 -0800
From:   Dan Williams <dan.j.williams@...el.com>
To:     "tj@...nel.org" <tj@...nel.org>,
        Gregory Price <gregory.price@...verge.com>
CC:     John Groves <john@...alactic.com>,
        Gregory Price <gourry.memverge@...il.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-cxl@...r.kernel.org" <linux-cxl@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "ying.huang@...el.com" <ying.huang@...el.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "mhocko@...nel.org" <mhocko@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "jgroves@...ron.com" <jgroves@...ron.com>
Subject: Re: [RFC PATCH v4 0/3] memcg weighted interleave mempolicy control

tj@...nel.org wrote:
> Hello,
> 
> On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote:
> > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@...nel.org wrote:
> ...
> > I've been considering this as well, but there's more context here being
> > lost.  It's not just about being able to toggle the policy of a single
> > task, or related tasks, but actually in support of a more global data
> > interleaving strategy that makes use of bandwidth more effectively as
> > we begin to memory expansion and bandwidth expansion occur on the
> > PCIE/CXL bus.
> > 
> > If the memory landscape of a system changes, for example due to a
> > hotplug event, you actually want to change the behavior of *every* task
> > that is using interleaving.  The fundamental bandwidth distribution of
> > the entire system changed, so the behavior of every task using that
> > memory should change with it.
> > 
> > We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
> > and now additionally cpusets. In the last email, I'd asked whether it
> > might actually be worth adding a new mpol component of cgroups to
> > aggregate these issues, rather than jam them into either component.
> > I would love your thoughts on that.
> 
> As for CXL and the changing memory landscape, I think some caution is
> necessary as with any expected "future" technology changes. The recent
> example with non-volatile memory isn't too far from CXL either. Note that
> this is not to say that we shouldn't change anything until the hardware is
> wildly popular but more that we need to be cognizant of the speculative
> nature and the possibility of overbuilding for it.
> 
> I don't have a golden answer but here are general suggestions: Build
> something which is small and/or useful even outside the context of the
> expected hardware landscape changes. Enable the core feature which is
> absolutely required in a minimal manner. Avoid being maximalist in feature
> and convenience coverage.

If I had to state the golden rule of kernel enabling, this paragraph
comes close to being it.

> Here, even if CXL actually becomes popular, how many are going to use memory
> hotplug and need to dynamically rebalance memory in actively running
> workloads? What's the scenario? Are there going to be an army of data center
> technicians going around plugging and unplugging CXL devices depending on
> system memory usage?

While I have personal skepticism that all of the infrastructure in the
CXL specification is going to become popular, one mechanism that seems
poised to cross that threshold is "dynamic capacity". So it is not the
case that techs are running around hot-adjusting physical memory. A host
will have a cable hop to a shared memory pool in the rack where it can
be dynamically provisioned across hosts.

However, even then the bounds of what is dynamic is going to be
constrained to a fixed address space with likely predictable performance
characteristics for that address range. That potentially allows for a
system wide memory interleave policy to be viable. That might be the
place to start and mirrors, at a coarser granularity, what hardware
interleaving can do.

[..]