linux-kernel - Re: [PATCH 0/5] Speed up mremap on large regions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201002053547.7roe7b4mpamw4uk2@black.fi.intel.com>
Date:   Fri, 2 Oct 2020 08:35:47 +0300
From:   "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
To:     Lokesh Gidra <lokeshgidra@...gle.com>
Cc:     Kalesh Singh <kaleshsingh@...gle.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Minchan Kim <minchan@...gle.com>,
        Joel Fernandes <joelaf@...gle.com>,
        "Cc: Android Kernel" <kernel-team@...roid.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        the arch/x86 maintainers <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Shuah Khan <shuah@...nel.org>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Kees Cook <keescook@...omium.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Sami Tolvanen <samitolvanen@...gle.com>,
        Masahiro Yamada <masahiroy@...nel.org>,
        Arnd Bergmann <arnd@...db.de>,
        Frederic Weisbecker <frederic@...nel.org>,
        Krzysztof Kozlowski <krzk@...nel.org>,
        Hassan Naveed <hnaveed@...ecomp.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Mark Rutland <mark.rutland@....com>,
        Mike Rapoport <rppt@...nel.org>, Gavin Shan <gshan@...hat.com>,
        Zhenyu Ye <yezhenyu2@...wei.com>, Jia He <justin.he@....com>,
        John Hubbard <jhubbard@...dia.com>,
        William Kucharski <william.kucharski@...cle.com>,
        Sandipan Das <sandipan@...ux.ibm.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Mina Almasry <almasrymina@...gle.com>,
        Ram Pai <linuxram@...ibm.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Kamalesh Babulal <kamalesh@...ux.vnet.ibm.com>,
        Masami Hiramatsu <mhiramat@...nel.org>,
        Brian Geffon <bgeffon@...gle.com>,
        SeongJae Park <sjpark@...zon.de>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        "moderated list:ARM64 PORT (AARCH64 ARCHITECTURE)" 
        <linux-arm-kernel@...ts.infradead.org>,
        "open list:MEMORY MANAGEMENT" <linux-mm@...ck.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH 0/5] Speed up mremap on large regions

On Thu, Oct 01, 2020 at 05:09:02PM -0700, Lokesh Gidra wrote:
> On Thu, Oct 1, 2020 at 9:00 AM Kalesh Singh <kaleshsingh@...gle.com> wrote:
> >
> > On Thu, Oct 1, 2020 at 8:27 AM Kirill A. Shutemov
> > <kirill.shutemov@...ux.intel.com> wrote:
> > >
> > > On Wed, Sep 30, 2020 at 03:42:17PM -0700, Lokesh Gidra wrote:
> > > > On Wed, Sep 30, 2020 at 3:32 PM Kirill A. Shutemov
> > > > <kirill.shutemov@...ux.intel.com> wrote:
> > > > >
> > > > > On Wed, Sep 30, 2020 at 10:21:17PM +0000, Kalesh Singh wrote:
> > > > > > mremap time can be optimized by moving entries at the PMD/PUD level if
> > > > > > the source and destination addresses are PMD/PUD-aligned and
> > > > > > PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and
> > > > > > x86. Other architectures where this type of move is supported and known to
> > > > > > be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD
> > > > > > and HAVE_MOVE_PUD.
> > > > > >
> > > > > > Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
> > > > > > region on x86 and arm64:
> > > > > >
> > > > > >     - HAVE_MOVE_PMD is already enabled on x86 : N/A
> > > > > >     - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
> > > > > >
> > > > > >     - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
> > > > > >     - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
> > > > > >
> > > > > >           Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
> > > > > >           give a total of ~150x speed up on arm64.
> > > > >
> > > > > Is there a *real* workload that benefit from HAVE_MOVE_PUD?
> > > > >
> > > > We have a Java garbage collector under development which requires
> > > > moving physical pages of multi-gigabyte heap using mremap. During this
> > > > move, the application threads have to be paused for correctness. It is
> > > > critical to keep this pause as short as possible to avoid jitters
> > > > during user interaction. This is where HAVE_MOVE_PUD will greatly
> > > > help.
> > >
> > > Any chance to quantify the effect of mremap() with and without
> > > HAVE_MOVE_PUD?
> > >
> > > I doubt it's a major contributor to the GC pause. I expect you need to
> > > move tens of gigs to get sizable effect. And if your GC routinely moves
> > > tens of gigs, maybe problem somewhere else?
> > >
> > > I'm asking for numbers, because increase in complexity comes with cost.
> > > If it doesn't provide an substantial benefit to a real workload
> > > maintaining the code forever doesn't make sense.
> >
> mremap is indeed the biggest contributor to the GC pause. It has to
> take place in what is typically known as a 'stop-the-world' pause,
> wherein all application threads are paused. During this pause the GC
> thread flips the GC roots (threads' stacks, globals etc.), and then
> resumes threads along with concurrent compaction of the heap.This
> GC-root flip differs depending on which compaction algorithm is being
> used.
> 
> In our case it involves updating object references in threads' stacks
> and remapping java heap to a different location. The threads' stacks
> can be handled in parallel with the mremap. Therefore, the dominant
> factor is indeed the cost of mremap. From patches 2 and 4, it is clear
> that remapping 1GB without this optimization will take ~9ms on arm64.
> 
> Although this mremap has to happen only once every GC cycle, and the
> typical size is also not going to be more than a GB or 2, pausing
> application threads for ~9ms is guaranteed to cause jitters. OTOH,
> with this optimization, mremap is reduced to ~60us, which is a totally
> acceptable pause time.
> 
> Unfortunately, implementation of the new GC algorithm hasn't yet
> reached the point where I can quantify the effect of this
> optimization. But I can confirm that without this optimization the new
> GC will not be approved.

IIUC, the 9ms -> 90us improvement attributed to combination HAVE_MOVE_PMD
and HAVE_MOVE_PUD, right? I expect HAVE_MOVE_PMD to be reasonable for some
workloads, but marginal benefit of HAVE_MOVE_PUD is in doubt. Do you see
it's useful for your workload?

-- 
 Kirill A. Shutemov