linux-kernel - Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1811281504030.231719@chino.kir.corp.google.com>
Date:   Wed, 28 Nov 2018 15:10:04 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
cc:     ying.huang@...el.com, Andrea Arcangeli <aarcange@...hat.com>,
        Michal Hocko <mhocko@...e.com>, s.priebe@...fihost.ag,
        mgorman@...hsingularity.net,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3%
 regression

On Wed, 28 Nov 2018, Linus Torvalds wrote:

> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying <ying.huang@...el.com> wrote:
> >
> > From the above data, for the parent commit 3 processes exited within
> > 14s, another 3 exited within 100s.  For this commit, the first process
> > exited at 203s.  That is, this commit makes memory allocation more fair
> > among processes, so that processes proceeded at more similar speed.  But
> > this raises system memory footprint too, so triggered much more swap,
> > thus lower benchmark score.
> >
> > In general, memory allocation fairness among processes should be a good
> > thing.  So I think the report should have been a "performance
> > improvement" instead of "performance regression".
> 
> Hey, when you put it that way...
> 
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.
> 

Well, I originally complained[*] when the change was first proposed and 
when the stable backports were proposed[**].  On a fragmented host, the 
change itself showed a 13.9% access latency regression on Haswell and up 
to 40% allocation latency regression.  This is more substantial on Naples 
and Rome.  I also measured similar numbers to this for Haswell.

We are particularly hit hard by this because we have libraries that remap 
the text segment of binaries to hugepages; hugetlbfs is not widely used so 
this normally falls back to transparent hugepages.  We mmap(), 
madvise(MADV_HUGEPAGE), memcpy(), mremap().  We fully accept the latency 
to do this when the binary starts because the access latency at runtime is 
so much better.

With this change, however, we have no userspace workaround other than 
mbind() to prefer the local node.  On all of our platforms, native sized 
pages are always a win over remote hugepages and it leaves open the 
opportunity that we collapse memory into hugepages later by khugepaged if 
fragmentation is the issue.  mbind() is not viable if the local node is 
saturated, we are ok with falling back to remote pages of the native page 
size when the local node is oom; this would result in an oom kill if we 
used it to retain the old behavior.

Given this severe access and allocation latency regression, we must revert 
this patch in our own kernel, there is simply no path forward without 
doing so.

[*] https://marc.info/?l=linux-kernel&m=153868420126775
[**] https://marc.info/?l=linux-kernel&m=154269994800842