linux-kernel - Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1812091420350.95551@chino.kir.corp.google.com>
Date:   Sun, 9 Dec 2018 14:44:23 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
cc:     Michal Hocko <mhocko@...nel.org>, Vlastimil Babka <vbabka@...e.cz>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        ying.huang@...el.com, s.priebe@...fihost.ag,
        mgorman@...hsingularity.net,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation
 regressions

On Wed, 5 Dec 2018, Andrea Arcangeli wrote:

> > I've must have said this at least six or seven times: fault latency is 
> 
> In your original regression report in this thread to Linus:
> 
> https://lkml.kernel.org/r/alpine.DEB.2.21.1811281504030.231719@chino.kir.corp.google.com
> 
> you said "On a fragmented host, the change itself showed a 13.9%
> access latency regression on Haswell and up to 40% allocation latency
>                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> regression. This is more substantial on Naples and Rome.  I also
> ^^^^^^^^^^
> measured similar numbers to this for Haswell."
> 
> > secondary to the *access* latency.  We want to try hard for MADV_HUGEPAGE 
> > users to do synchronous compaction and try to make a hugepage available.  
> 
> I'm glad you said it six or seven times now, because you forgot to
> mention in the above email that the "40% allocation/fault latency
> regression" you reported above, is actually a secondary concern because
> those must be long lived allocations and we can't yet generate
> compound pages for free after all..
> 

I've been referring to the long history of this discussion, namely my 
explicit Nacked-by in https://marc.info/?l=linux-kernel&m=153868420126775 
two months ago stating the 13.9% access latency regression.  The patch was 
nonetheless still merged and I proposed the revert for the same chief 
complaint, and it was reverted.

I brought up the access latency issue three months ago in
https://marc.info/?l=linux-kernel&m=153661012118046 and said allocation 
latency was a secondary concern, specifically that our users of 
MADV_HUGEPAGE are willing to accept the increased allocation latency for 
local hugepages.

> BTW, I never bothered to ask yet, but, did you enable NUMA balancing
> in your benchmarks? NUMA balancing would fix the access latency very
> easily too, so that 13.9% access latency must quickly disappear if you
> correctly have NUMA balancing enabled in a NUMA system.
> 

No, we do not have CONFIG_NUMA_BALANCING enabled.  The __GFP_THISNODE 
behavior for hugepages was added in 4.0 for the PPC usecase, not by me.  
That had nothing to do with the madvise mode: the initial documentation 
referred to the mode as a way to prevent an increase in rss for configs 
where "enabled" was set to madvise.  The allocation policy was never about 
MADV_HUGEPAGE in any 4.x kernel, it was only an indication for certain 
defrag settings to determine how much work should be done to allocate 
*local* hugepages at fault.

If you are saying that the change in allocator policy in a patch from 
Aneesh almost four years ago and has gone unreported by anybody up until a 
few months ago, I can understand the frustration.  I do, however, support 
the __GFP_THISNODE change he made because his data shows the same results 
as mine.

I've suggested a very simple extension, specifically a prctl() mode that 
is inherited across fork, that would allow a workload to specify that it 
prefers remote allocations over local compaction/reclaim because it is too 
large to fit on a single node.  I'd value your feedback for that 
suggestion to fix your usecase.