linux-kernel - Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YqH74WaUzJlb+smt@cmpxchg.org>
Date:   Thu, 9 Jun 2022 09:55:45 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com>
Cc:     linux-mm@...ck.org, akpm@...ux-foundation.org,
        Wei Xu <weixugc@...gle.com>, Huang Ying <ying.huang@...el.com>,
        Greg Thelen <gthelen@...gle.com>,
        Yang Shi <shy828301@...il.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Tim C Chen <tim.c.chen@...el.com>,
        Brice Goglin <brice.goglin@...il.com>,
        Michal Hocko <mhocko@...nel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Hesham Almatary <hesham.almatary@...wei.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        Alistair Popple <apopple@...dia.com>,
        Dan Williams <dan.j.williams@...el.com>,
        Feng Tang <feng.tang@...el.com>,
        Jagdish Gediya <jvgediya@...ux.ibm.com>,
        Baolin Wang <baolin.wang@...ux.alibaba.com>,
        David Rientjes <rientjes@...gle.com>
Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers

On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 11:46 PM, Johannes Weiner wrote:
> > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> > > On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > > > Hello,
> > > > 
> > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > > > @@ -0,0 +1,20 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > +
> > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > +
> > > > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > > > +#define MEMORY_TIER_DRAM	1
> > > > > > +#define MEMORY_TIER_PMEM	2
> > > > > > +
> > > > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > > > +#define MEMORY_RANK_DRAM	200
> > > > > > +#define MEMORY_RANK_PMEM	100
> > > > > > +
> > > > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > > > +#define MAX_MEMORY_TIERS  3
> > > > > 
> > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > > 
> > > > > But starting out with a packed ID space can get quite awkward for
> > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > DRAM, so new tiers seem inevitable already.
> > > > > 
> > > > > It could make sense to start with a bigger address space and spread
> > > > > out the list of kernel default tiers a bit within it:
> > > > > 
> > > > > MEMORY_TIER_GPU		0
> > > > > MEMORY_TIER_DRAM	10
> > > > > MEMORY_TIER_PMEM	20
> > > > 
> > > > Forgive me if I'm asking a question that has been answered. I went
> > > > back to earlier threads and couldn't work it out - maybe there were
> > > > some off-list discussions? Anyway...
> > > > 
> > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > rank was added because tier IDs were too few. But if rank determines
> > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > based on known hardware, with plenty of headroom to be future proof.
> > > > 
> > > >     $ ls tiers
> > > >     100				# DEFAULT_TIER
> > > >     $ cat tiers/100/nodelist
> > > >     0-1				# conventional numa nodes
> > > > 
> > > >     <pmem is onlined>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	# conventional numa
> > > >     tiers/200/nodelist:2		# pmem
> > > > 
> > > >     $ grep . nodes/*/tier
> > > >     nodes/0/tier:100
> > > >     nodes/1/tier:100
> > > >     nodes/2/tier:200
> > > > 
> > > >     <unknown device is online as node 3, defaults to 100>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1,3
> > > >     tiers/200/nodelist:2
> > > > 
> > > >     $ echo 300 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1
> > > >     tiers/200/nodelist:2
> > > >     tiers/300/nodelist:3
> > > > 
> > > >     $ echo 200 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	
> > > >     tiers/200/nodelist:2-3
> > > > 
> > > > etc.
> > > 
> > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > would need the ability to change the rank value of a memory tier. If we make
> > > rank value same as tier ID or tier device id, we will not be able to support
> > > that.
> > 
> > Is the idea that you could change the rank of a collection of nodes in
> > one go? Rather than moving the nodes one by one into a new tier?
> > 
> > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> >    patches in RFC4 already had the struct device { .id = tier }
> >    logic. Could you point me to it? In general it would be really
> >    helpful to maintain summarized rationales for such decisions in the
> >    coverletter to make sure things don't get lost over many, many
> >    threads, conferences, and video calls. ]
> 
> Most of the discussion happened not int he patch review email threads.
> 
> RFC: Memory Tiering Kernel Interfaces (v2)
> https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
> 
> RFC: Memory Tiering Kernel Interfaces (v4)
> https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

I read the RFCs, the discussions and your code. It's still not clear
why the tier/device ID and the rank need to be two separate,
user-visible things. There is only one tier of a given rank, why can't
the rank be the unique device id? dev->id = 100. One number. Or use a
unique device id allocator if large numbers are causing problems
internally. But I don't see an explanation why they need to be two
different things, let alone two different things in the user ABI.