[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YgI06D8MbEpchooF@google.com>
Date: Tue, 8 Feb 2022 02:16:24 -0700
From: Yu Zhao <yuzhao@...gle.com>
To: Barry Song <21cnbao@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andi Kleen <ak@...ux.intel.com>,
Catalin Marinas <catalin.marinas@....com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
Jesse Barnes <jsbarnes@...gle.com>,
Johannes Weiner <hannes@...xchg.org>,
Jonathan Corbet <corbet@....net>,
Matthew Wilcox <willy@...radead.org>,
Mel Gorman <mgorman@...e.de>,
Michael Larabel <Michael@...haellarabel.com>,
Michal Hocko <mhocko@...nel.org>,
Rik van Riel <riel@...riel.com>,
Vlastimil Babka <vbabka@...e.cz>,
Will Deacon <will@...nel.org>,
Ying Huang <ying.huang@...el.com>,
LAK <linux-arm-kernel@...ts.infradead.org>,
Linux Doc Mailing List <linux-doc@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>, page-reclaim@...gle.com,
x86 <x86@...nel.org>
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework
On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote:
> On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > <snipped>
> >
> > > > Large-scale deployments
> > > > -----------------------
> > > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > > about a million Android users. Google's fleetwide profiling [13] shows
> > > > an overall 40% decrease in kswapd CPU usage, in addition to
> > >
> > > Hi Yu,
> > >
> > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > > Does it help a lot in decreasing the cpu usage?
> >
> > Hi Barry,
> >
> > The fleet-wide profiling data I shared was from x86. For arm64, I only
> > have data from synthetic benchmarks at the moment, and it also shows
> > similar improvements.
> >
> > For Chrome OS (individual users), walk_pte_range(), the function that
> > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> > that helpful.
>
> Hi Yu,
> Thanks!
>
> In the current kernel, depending on reverse mapping, while memory is
> under pressure,
> the cpu usage of kswapd can be very very high especially while a lot of pages
> have large mapcount, thus a huge reverse mapping cost.
Agreed. I've posted v7 which includes kswapd profiles collected from an
arm64 v8.2 laptop under memory pressure.
> Regarding <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG?
No, it's from Snapdragon 7c. Please see the kswapd profiles in v7.
> In this case, we can skip many PTE scans while PMD has no accessed bit
> set. But for
> a machine without NONLEAF, will the figure of cpu usage be much larger?
So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But
this definitely can vary, depending on the workloads.
> > > If so, this might be
> > > a good proof that arm64 also needs this hardware feature?
> > > In short, I am curious how much the improvement in this patchset depends
> > > on the hardware ability of NONLEAF_PMD_YOUNG.
> >
> > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> > In addition to cold/hot memory scanning, there are other use cases like
> > dirty tracking, which can benefit from the accessed bit on non-leaf
> > entries. I know some proprietary software uses this capability on x86
> > for different purposes than this patchset does. And AFAIK, x86 is the
> > only arch that supports this capability, e.g., risc-v and ppc can only
> > set the accessed bit in PTEs.
>
> Yep. NONLEAF is a nice feature.
>
> btw, page table should have a separate DIRTY bit, right?
Yes.
> wouldn't dirty page
> tracking depend on the DIRTY bit rather than the accessed bit?
It depends on the goal.
> so x86 also has
> NONLEAF dirty bit?
No.
> Or they are scanning accessed bit of PMD before
> scanning DIRTY bits of PTEs?
A mandatory sync to disk must use the dirty bit to ensure data
integrity. But for a voluntary sync to disk, it can use the accessed
bit to narrow the search of dirty pages.
A mandatory sync is used to free specific dirty pages. A voluntary sync
is used to keep the number of dirty pages low in general and it doesn't
target any specific dirty pages.
> > In fact, I've discussed this with one of the arm maintainers Will. So
> > please check with him too if you are interested in moving forward with
> > the idea. I might be able to provide with additional data if you need
> > it to make a decision.
>
> I am interested in running it and have some data without NONLEAF
> especially while free memory is very limited and the system has memory
> thrashing.
The v7 has a switch to disable this feature on x86. If you can run your
workloads on x86, then it might be able to help you measure the difference.
Powered by blists - more mailing lists