[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YqH74WaUzJlb+smt@cmpxchg.org>
Date: Thu, 9 Jun 2022 09:55:45 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Aneesh Kumar K V <aneesh.kumar@...ux.ibm.com>
Cc: linux-mm@...ck.org, akpm@...ux-foundation.org,
Wei Xu <weixugc@...gle.com>, Huang Ying <ying.huang@...el.com>,
Greg Thelen <gthelen@...gle.com>,
Yang Shi <shy828301@...il.com>,
Davidlohr Bueso <dave@...olabs.net>,
Tim C Chen <tim.c.chen@...el.com>,
Brice Goglin <brice.goglin@...il.com>,
Michal Hocko <mhocko@...nel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Hesham Almatary <hesham.almatary@...wei.com>,
Dave Hansen <dave.hansen@...el.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Alistair Popple <apopple@...dia.com>,
Dan Williams <dan.j.williams@...el.com>,
Feng Tang <feng.tang@...el.com>,
Jagdish Gediya <jvgediya@...ux.ibm.com>,
Baolin Wang <baolin.wang@...ux.alibaba.com>,
David Rientjes <rientjes@...gle.com>
Subject: Re: [PATCH v5 1/9] mm/demotion: Add support for explicit memory tiers
On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 11:46 PM, Johannes Weiner wrote:
> > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> > > On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > > > Hello,
> > > >
> > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > > > @@ -0,0 +1,20 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > +
> > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > +
> > > > > > +#define MEMORY_TIER_HBM_GPU 0
> > > > > > +#define MEMORY_TIER_DRAM 1
> > > > > > +#define MEMORY_TIER_PMEM 2
> > > > > > +
> > > > > > +#define MEMORY_RANK_HBM_GPU 300
> > > > > > +#define MEMORY_RANK_DRAM 200
> > > > > > +#define MEMORY_RANK_PMEM 100
> > > > > > +
> > > > > > +#define DEFAULT_MEMORY_TIER MEMORY_TIER_DRAM
> > > > > > +#define MAX_MEMORY_TIERS 3
> > > > >
> > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > >
> > > > > But starting out with a packed ID space can get quite awkward for
> > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > DRAM, so new tiers seem inevitable already.
> > > > >
> > > > > It could make sense to start with a bigger address space and spread
> > > > > out the list of kernel default tiers a bit within it:
> > > > >
> > > > > MEMORY_TIER_GPU 0
> > > > > MEMORY_TIER_DRAM 10
> > > > > MEMORY_TIER_PMEM 20
> > > >
> > > > Forgive me if I'm asking a question that has been answered. I went
> > > > back to earlier threads and couldn't work it out - maybe there were
> > > > some off-list discussions? Anyway...
> > > >
> > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > rank was added because tier IDs were too few. But if rank determines
> > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > based on known hardware, with plenty of headroom to be future proof.
> > > >
> > > > $ ls tiers
> > > > 100 # DEFAULT_TIER
> > > > $ cat tiers/100/nodelist
> > > > 0-1 # conventional numa nodes
> > > >
> > > > <pmem is onlined>
> > > >
> > > > $ grep . tiers/*/nodelist
> > > > tiers/100/nodelist:0-1 # conventional numa
> > > > tiers/200/nodelist:2 # pmem
> > > >
> > > > $ grep . nodes/*/tier
> > > > nodes/0/tier:100
> > > > nodes/1/tier:100
> > > > nodes/2/tier:200
> > > >
> > > > <unknown device is online as node 3, defaults to 100>
> > > >
> > > > $ grep . tiers/*/nodelist
> > > > tiers/100/nodelist:0-1,3
> > > > tiers/200/nodelist:2
> > > >
> > > > $ echo 300 >nodes/3/tier
> > > > $ grep . tiers/*/nodelist
> > > > tiers/100/nodelist:0-1
> > > > tiers/200/nodelist:2
> > > > tiers/300/nodelist:3
> > > >
> > > > $ echo 200 >nodes/3/tier
> > > > $ grep . tiers/*/nodelist
> > > > tiers/100/nodelist:0-1
> > > > tiers/200/nodelist:2-3
> > > >
> > > > etc.
> > >
> > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > would need the ability to change the rank value of a memory tier. If we make
> > > rank value same as tier ID or tier device id, we will not be able to support
> > > that.
> >
> > Is the idea that you could change the rank of a collection of nodes in
> > one go? Rather than moving the nodes one by one into a new tier?
> >
> > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> > patches in RFC4 already had the struct device { .id = tier }
> > logic. Could you point me to it? In general it would be really
> > helpful to maintain summarized rationales for such decisions in the
> > coverletter to make sure things don't get lost over many, many
> > threads, conferences, and video calls. ]
>
> Most of the discussion happened not int he patch review email threads.
>
> RFC: Memory Tiering Kernel Interfaces (v2)
> https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
>
> RFC: Memory Tiering Kernel Interfaces (v4)
> https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
I read the RFCs, the discussions and your code. It's still not clear
why the tier/device ID and the rank need to be two separate,
user-visible things. There is only one tier of a given rank, why can't
the rank be the unique device id? dev->id = 100. One number. Or use a
unique device id allocator if large numbers are causing problems
internally. But I don't see an explanation why they need to be two
different things, let alone two different things in the user ABI.
Powered by blists - more mailing lists