linux-kernel - Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1909081328220.178796@chino.kir.corp.google.com>
Date:   Sun, 8 Sep 2019 13:45:13 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Vlastimil Babka <vbabka@...e.cz>
cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Michal Hocko <mhocko@...e.com>, Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote
 hugepages

On Sun, 8 Sep 2019, Vlastimil Babka wrote:

> > On Sat, 7 Sep 2019, Linus Torvalds wrote:
> > 
> >>> Andrea acknowledges the swap storm that he reported would be fixed with
> >>> the last two patches in this series
> >>
> >> The problem is that even you aren't arguing that those patches should
> >> go into 5.3.
> >>
> > 
> > For three reasons: (a) we lack a test result from Andrea,
> 
> That's argument against the rfc patches 3+4s, no? But not for including
> the reverts of reverts of reverts (patches 1+2).
> 

Yes, thanks: I would strongly prefer not to propose rfc patches 3-4 
without a testing result from Andrea and collaboration to fix the 
underlying issue.  My suggestion to Linus is to merge patches 1-2 so we 
don't have additional semantics for MADV_HUGEPAGE or thp enabled=always 
configs based on kernel version, especially since they are already 
conflated.

> > (b) there's 
> > on-going discussion, particularly based on Vlastimil's feedback, and 
> 
> I doubt this will be finished and tested with reasonable confidence even
> for the 5.4 merge window.
> 

Depends, but I probably suspect the same.  If the reverts to 5.3 are not 
applied, then I'm not at all confident that forward progress on this issue 
will be made: my suggestion about changes to the page allocator when the 
patches were initially proposed went unresponded to, as did the ping on 
those suggestions, and now we have a simplistic "this will fix the swap 
storms" but no active involvement from Andrea to improve this; he likely 
is quite content on lumping NUMA policy onto an already overloaded madvise 
mode.

 [ NOTE! The rest of this email and my responses are about how to address
   the default page allocation behavior which we can continue to discuss
   but I'd prefer it separated from the discussion of reverts for 5.3
   which needs to be done to not conflate madvise modes with mempolicies
   for a subset of kernel versions. ]

> > It indicates that progress has been made to address the actual bug without 
> > introducing long-lived access latency regressions for others, particularly 
> > those who use MADV_HUGEPAGE.  In the worst case, some systems running 
> > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but 
> > on 5.3-rc5 the vast majority of it is allocated remotely.  This incurs a
> 
> It's been said before, but such sensitive code generally relies on
> mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation
> details. Or if you know there's enough free memory and just needs to be
> compacted, you could do it once via sysfs before starting up your workload.
> 

This entire discussion is based on the long standing and default behavior 
of page allocation for transparent hugepages.  Your suggestions are not 
possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND 
because this doesn't allow fallback at all and would oom kill if the local 
node is oom, and (2) node reclaim mode is a system-wide setting so all 
workloads are affected for every page allocation, not only users of 
MADV_HUGEPAGE who specifically opt-in to expensive allocation.

We could make the argument that Andrea's qemu usecase could simply use 
MPOL_PREFERRED for memory that should be faulted remotely which would 
provide more control and would work for all versions of Linux regardless 
of MADV_HUGEPAGE or not; that's a much more simple workaround than 
conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely 
affected by 5.3 to create new mempolicies to work around something that 
has always worked fine, or asking users to tune page allocator policies 
with sysctls.

> > I'm arguing to revert 5.3 back to the behavior that we have had for years 
> > and actually fix the bug that everybody else seems to be ignoring and then 
> > *backport* those fixes to 5.3 stable and every other stable tree that can 
> > use them.  Introducing a new mempolicy for NUMA locality into 5.3.0 that
> 
> I think it's rather removing the problematic implicit mempolicy of
> __GFP_THISNODE.
> 

I'm referring to a solution that is backwards compatible for existing 
users which 5.3 is certainly not.

> I might have missed something, but you were asked for a reproducer of
> your use case so others can develop patches with it in mind? Mel did
> provide a simple example that shows the swap storms very easily.
> 

Are you asking for a synthetic kernel module that you can inject to induce 
fragmentation on a local node where memory compaction would be possible 
and then a userspace program that uses MADV_HUGEPAGE and fits within that 
node?  The regression I'm reporting is for workloads that fit within a 
socket, it requires local fragmentation to show a regression.

For the qemu case, it's quite easy to fill a local node and require 
additional hugepage allocations with MADV_HUGEPAGE in a test case, but for
without synthetically inducing fragmentation I cannot provide a testcase 
that will show performance regression because memory is quickly faulted 
remotely rather than compacting locally.