linux-kernel - Re: [PATCH] mm: clear 1G pages with streaming stores on x86

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200311005447.jkpsaghrpk3c4rwu@box>
Date:   Wed, 11 Mar 2020 03:54:47 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Cannon Matthews <cannonmatthews@...gle.com>
Cc:     Matthew Wilcox <willy@...radead.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Michal Hocko <mhocko@...nel.org>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Greg Thelen <gthelen@...gle.com>,
        Salman Qazi <sqazi@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, x86@...nel.org
Subject: Re: [PATCH] mm: clear 1G pages with streaming stores on x86

On Tue, Mar 10, 2020 at 05:21:30PM -0700, Cannon Matthews wrote:
> On Mon, Mar 9, 2020 at 11:37 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Mon, Mar 09, 2020 at 08:38:31AM -0700, Andi Kleen wrote:
> > > > Gigantic huge pages are a bit different. They are much less dynamic from
> > > > the usage POV in my experience. Micro-optimizations for the first access
> > > > tends to not matter at all as it is usually pre-allocation scenario. On
> > > > the other hand, speeding up the initialization sounds like a good thing
> > > > in general. It will be a single time benefit but if the additional code
> > > > is not hard to maintain then I would be inclined to take it even with
> > > > "artificial" numbers state above. There really shouldn't be other downsides
> > > > except for the code maintenance, right?
> > >
> > > There's a cautious tale of the old crappy RAID5 XOR assembler functions which
> > > were optimized a long time ago for the Pentium1, and stayed around,
> > > even though the compiler could actually do a better job.
> > >
> > > String instructions are constantly improving in performance (Broadwell is
> > > very old at this point) Most likely over time (and maybe even today
> > > on newer CPUs) you would need much more sophisticated unrolled MOVNTI variants
> > > (or maybe even AVX-*) to be competitive.
> >
> > Presumably you have access to current and maybe even some unreleased
> > CPUs ... I mean, he's posted the patches, so you can test this hypothesis.
> 
> I don't have the data at hand, but could reproduce it if strongly
> desired, but I've also tested this on skylake and  cascade lake, and
> we've had success running with this for a while now.
> 
> When developing this originally, I tested all of this compared with
> AVX-* instructions as well as the string ops, they all seemed to be
> functionally equivalent, and all were beat out by this MOVNTI thing for
> large regions of 1G pages.
> 
> There is probably room to further optimize the MOVNTI stuff with better
> loop unrolling or optimizations, if anyone has specific suggestions I'm
> happy to try to incorporate them, but this has shown to be effective as
> written so far, and I think I lack that assembly expertise to micro
> optimize further on my own.

Andi's point is that string instructions might be a better bet in a long
run. You may win something with MOVNTI on current CPUs, but it may become
a burden on newer microarchitectures when string instructions improves.
Nobody realistically would re-validate if MOVNTI microoptimazation still
make sense for every new microarchitecture.

> 
> But just in general, while there are probably some ways this could be
> made better, it does a good job so far for the workloads that are more
> specific to 1G pages.
> 
> Making it work for 2MiB in a convincing general purpose way is a harder
> problem and feels out of scope, and further optimizations can always be
> added later on for some other things.
> 
> I'm working on a v2 of this patch addressing some of the nits mentioned
> by Andrew, should have that hopefully soon.

Have you got any data for a macrobenchmark?

-- 
 Kirill A. Shutemov