linux-kernel - Re: [PATCH] mm: mempolicy: N:M interleave policy for tiered memory nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YqD0/tzFwXvJ1gK6@cmpxchg.org>
Date:   Wed, 8 Jun 2022 15:14:06 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Tim Chen <tim.c.chen@...ux.intel.com>
Cc:     linux-mm@...ck.org, Hao Wang <haowang3@...com>,
        Abhishek Dhanotia <abhishekd@...com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Adam Manzanares <a.manzanares@...sung.com>,
        linux-kernel@...r.kernel.org, kernel-team@...com,
        Hasan Al Maruf <hasanalmaruf@...com>
Subject: Re: [PATCH] mm: mempolicy: N:M interleave policy for tiered memory
 nodes

Hi Tim,

On Wed, Jun 08, 2022 at 11:15:27AM -0700, Tim Chen wrote:
> On Tue, 2022-06-07 at 13:19 -0400, Johannes Weiner wrote:
> > 
> >  /* Do dynamic interleaving for a process */
> >  static unsigned interleave_nodes(struct mempolicy *policy)
> >  {
> >  	unsigned next;
> >  	struct task_struct *me = current;
> >  
> > -	next = next_node_in(me->il_prev, policy->nodes);
> > +	if (numa_tier_interleave[0] > 1 || numa_tier_interleave[1] > 1) {
> 
> When we have three memory tiers, do we expect an N:M:K policy?
> Like interleaving between DDR5, DDR4 and PMEM memory.
> Or we expect an N:M policy still by interleaving between two specific tiers?

In the context of the proposed 'explicit tiers' interface, I think it
would make sense to have a per-tier 'interleave_ratio knob. Because
the ratio is configured based on hardware properties, it can be
configured meaningfully for the entire tier hierarchy, even if
individual tasks or vmas interleave over only a subset of nodes.

> The other question is whether we will need multiple interleave policies depending
> on cgroup?
> One policy could be interleave between tier1, tier2, tier3.
> Another could be interleave between tier1 and tier2.

This is a good question.

One thing that has defined cgroup development in recent years is the
concept of "work conservation". Moving away from fixed limits and hard
partitioning, cgroups are increasingly configured with weights,
priorities, and guarantees (cpu.weight, io.latency/io.cost.qos,
memory.low). These weights and priorities are enforced when cgroups
are directly competing over a resource; but if there is no contention,
any active cgroup, regardless of priority, has full access to the
surplus (which could be the entire host if the main load is idle).

With that background, yes, we likely want some way of prioritizing
tier access when multiple cgroups are competing. But we ALSO want the
ability to say that if resources are NOT contended, a cgroup should
interleave memory over all tiers according to optimal bandwidth.

That means that regardless of how the competitive cgroup rules for
tier access end up looking like, it makes sense to have global
interleaving weights based on hardware properties as proposed here.

The effective cgroup IL ratio for each tier could then be something
like cgroup.tier_weight[tier] * tier/interleave_weight.