linux-kernel - Re: [PATCH v6 0/9] Multigenerational LRU Framework

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YgI06D8MbEpchooF@google.com>
Date:   Tue, 8 Feb 2022 02:16:24 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Barry Song <21cnbao@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Michael Larabel <Michael@...haellarabel.com>,
        Michal Hocko <mhocko@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        LAK <linux-arm-kernel@...ts.infradead.org>,
        Linux Doc Mailing List <linux-doc@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>, page-reclaim@...gle.com,
        x86 <x86@...nel.org>
Subject: Re: [PATCH v6 0/9] Multigenerational LRU Framework

On Fri, Jan 28, 2022 at 09:54:09PM +1300, Barry Song wrote:
> On Tue, Jan 25, 2022 at 7:48 PM Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > On Sun, Jan 23, 2022 at 06:43:06PM +1300, Barry Song wrote:
> > > On Wed, Jan 5, 2022 at 7:17 PM Yu Zhao <yuzhao@...gle.com> wrote:
> >
> > <snipped>
> >
> > > > Large-scale deployments
> > > > -----------------------
> > > > We've rolled out MGLRU to tens of millions of Chrome OS users and
> > > > about a million Android users. Google's fleetwide profiling [13] shows
> > > > an overall 40% decrease in kswapd CPU usage, in addition to
> > >
> > > Hi Yu,
> > >
> > > Was the overall 40% decrease of kswap CPU usgae seen on x86 or arm64?
> > > And I am curious how much we are taking advantage of NONLEAF_PMD_YOUNG.
> > > Does it help a lot in decreasing the cpu usage?
> >
> > Hi Barry,
> >
> > The fleet-wide profiling data I shared was from x86. For arm64, I only
> > have data from synthetic benchmarks at the moment, and it also shows
> > similar improvements.
> >
> > For Chrome OS (individual users), walk_pte_range(), the function that
> > would benefit from ARCH_HAS_NONLEAF_PMD_YOUNG, only uses a small
> > portion (<4%) of kswapd CPU time. So ARCH_HAS_NONLEAF_PMD_YOUNG isn't
> > that helpful.
> 
> Hi Yu,
> Thanks!
> 
> In the current kernel, depending on reverse mapping, while memory is
> under pressure,
> the cpu usage of kswapd can be very very high especially while a lot of pages
> have large mapcount, thus a huge reverse mapping cost.

Agreed. I've posted v7 which includes kswapd profiles collected from an
arm64 v8.2 laptop under memory pressure.

> Regarding  <4%, I guess the figure came from machines with NONLEAF_PMD_YOUNG？

No, it's from Snapdragon 7c. Please see the kswapd profiles in v7.

> In this case, we can skip many PTE scans while PMD has no accessed bit
> set. But for
> a machine without NONLEAF, will the figure of cpu usage be much larger?

So NONLEAF_PMD_YOUNG at most can save 4% CPU usage from kswapd. But
this definitely can vary, depending on the workloads.

> > > If so, this might be
> > > a good proof that arm64 also needs this hardware feature?
> > > In short, I am curious how much the improvement in this patchset depends
> > > on the hardware ability of NONLEAF_PMD_YOUNG.
> >
> > For data centers, I do think ARCH_HAS_NONLEAF_PMD_YOUNG has some value.
> > In addition to cold/hot memory scanning, there are other use cases like
> > dirty tracking, which can benefit from the accessed bit on non-leaf
> > entries. I know some proprietary software uses this capability on x86
> > for different purposes than this patchset does. And AFAIK, x86 is the
> > only arch that supports this capability, e.g., risc-v and ppc can only
> > set the accessed bit in PTEs.
> 
> Yep. NONLEAF is a nice feature.
> 
> btw, page table should have a separate DIRTY bit, right?

Yes.

> wouldn't dirty page
> tracking depend on the DIRTY bit rather than the accessed bit?

It depends on the goal.

> so x86 also has
> NONLEAF dirty bit?

No.

> Or they are scanning accessed bit of PMD before
> scanning DIRTY bits of PTEs?

A mandatory sync to disk must use the dirty bit to ensure data
integrity. But for a voluntary sync to disk, it can use the accessed
bit to narrow the search of dirty pages.

A mandatory sync is used to free specific dirty pages. A voluntary sync
is used to keep the number of dirty pages low in general and it doesn't
target any specific dirty pages.

> > In fact, I've discussed this with one of the arm maintainers Will. So
> > please check with him too if you are interested in moving forward with
> > the idea. I might be able to provide with additional data if you need
> > it to make a decision.
> 
> I am interested in running it and have some data without NONLEAF
> especially while free memory is very limited and the system has memory
> thrashing.

The v7 has a switch to disable this feature on x86. If you can run your
workloads on x86, then it might be able to help you measure the difference.