[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJj2-QGtvvrhjH_h1wL3FCg4HgZU27rqxSCDZzPws81yPK_DvQ@mail.gmail.com>
Date: Mon, 26 Aug 2024 16:43:01 -0700
From: Yuanchu Xie <yuanchu@...gle.com>
To: gourry@...rry.net
Cc: David Hildenbrand <david@...hat.com>, "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Khalid Aziz <khalid.aziz@...cle.com>, Henry Huang <henry.hj@...group.com>,
Yu Zhao <yuzhao@...gle.com>, Dan Williams <dan.j.williams@...el.com>,
Gregory Price <gregory.price@...verge.com>, Huang Ying <ying.huang@...el.com>,
Andrew Morton <akpm@...ux-foundation.org>, Lance Yang <ioworker0@...il.com>,
Randy Dunlap <rdunlap@...radead.org>, Muhammad Usama Anjum <usama.anjum@...labora.com>,
Kalesh Singh <kaleshsingh@...gle.com>, Wei Xu <weixugc@...gle.com>,
David Rientjes <rientjes@...gle.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
"Rafael J. Wysocki" <rafael@...nel.org>, Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>, Muchun Song <muchun.song@...ux.dev>,
Shuah Khan <shuah@...nel.org>, Yosry Ahmed <yosryahmed@...gle.com>,
Matthew Wilcox <willy@...radead.org>, Sudarshan Rajagopalan <quic_sudaraja@...cinc.com>,
Kairui Song <kasong@...cent.com>, "Michael S. Tsirkin" <mst@...hat.com>,
Vasily Averin <vasily.averin@...ux.dev>, Nhat Pham <nphamcs@...il.com>,
Miaohe Lin <linmiaohe@...wei.com>, Qi Zheng <zhengqi.arch@...edance.com>,
Abel Wu <wuyun.abel@...edance.com>, "Vishal Moola (Oracle)" <vishal.moola@...il.com>,
Kefeng Wang <wangkefeng.wang@...wei.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, cgroups@...r.kernel.org, linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v3 0/7] mm: workingset reporting
On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@...rry.net> wrote:
>
> On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace. IMO, the
> > kernel should provide a set of workingset interfaces that should be
> > generic enough to accommodate the various use cases, and be extensible
> > to potential future use cases. The current proposed interfaces are not
> > sufficient in that regard, but I would like to start somewhere, solicit
> > feedback, and iterate.
> >
> ... snip ...
> > Use cases
> > ==========
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > I realize this does not generalize well to hotness information, but I
> > lack the intuition for an abstraction that presents hotness in a useful
> > way. Based on a recent proposal for move_phys_pages[2], it seems like
> > userspace tiering software would like to move specific physical pages,
> > instead of informing the kernel "move x number of hot pages to y
> > device". Please advise.
> >
> > [2]
> > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> >
>
> Just as a note on this work, this is really a testing interface. The
> end-goal is not to merge such an interface that is user-facing like
> move_phys_pages, but instead to have something like a triggered kernel
> task that has a directive of "Promote X pages from Device A".
>
> This work is more of an open collaboration for prototyping such that we
> don't have to plumb it through the kernel from the start and assess the
> usefulness of the hardware hotness collection mechanism.
Understood. I think we previously had this exchange and I forgot to
remove the mentions from the cover letter.
>
> ---
>
> More generally on promotion, I have been considering recently a problem
> with promoting unmapped pagecache pages - since they are not subject to
> NUMA hint faults. I started looking at PG_accessed and PG_workingset as
> a potential mechanism to trigger promotion - but i'm starting to see a
> pattern of competing priorities between reclaim (LRU/MGLRU) logic and
> promotion logic.
In this case, IMO hardware support would be good as it could provide
the kernel with exactly what pages are hot, and it would not care
whether a page is mapped or not. I recall there being some CXL
proposal on this, but I'm not sure whether it has settled into a
standard yet.
>
> Reclaim is triggered largely under memory pressure - which means co-opting
> reclaim logic for promotion is at best logically confusing, and at worst
> likely to introduce regressions. The LRU/MGLRU logic is written largely
> for reclaim, not promotion. This makes hacking promotion in after the
> fact rather dubious - the design choices don't match.
>
> One example: if a page moves from inactive->active (or old->young), we
> could treat this as a page "becoming hot" and mark it for promotion, but
> this potentially punishes pages on the "active/younger" lists which are
> themselves hotter.
To avoid punishing pages on the "young" list, one could insert the
page into a "less young" generation, but it would be difficult to have
a fixed policy for this in the kernel, so it may be best for this to
be configurable via BPF. One could insert the page in the middle of
the active/inactive list, but that would in effect create multiple
generations.
>
> I'm starting to think separate demotion/reclaim and promotion components
> are warranted. This could take the form of a separate kernel worker that
> occasionally gets scheduled to manage a promotion list, or even the
> addition of a PG_promote flag to decouple reclaim and promotion logic
> completely. Separating the structures entirely would be good to allow
> both demotion/reclaim and promotion to occur concurrently (although this
> seems problematic under memory pressure).
>
> Would like to know your thoughts here. If we can decide to segregate
> promotion and demotion logic, it might go a long way to simplify the
> existing interfaces and formalize transactions between the two.
The two systems still have to interact, so separating the two would
essentially create a new policy that decides whether the
demotion/reclaim or the promotion policy is in effect. If promotion
could figure out where to insert the page in terms of generations,
wouldn't that be simpler?
>
> (also if you're going to LPC, might be worth a chat in person)
I cannot make it to LPC. :( Sadness
Yuanchu
Powered by blists - more mailing lists