linux-kernel - Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1812051126250.240991@chino.kir.corp.google.com>
Date:   Wed, 5 Dec 2018 11:41:31 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Mel Gorman <mgorman@...hsingularity.net>
cc:     Michal Hocko <mhocko@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>, ying.huang@...el.com,
        s.priebe@...fihost.ag,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation
 regressions

On Wed, 5 Dec 2018, Mel Gorman wrote:

> > This is a single MADV_HUGEPAGE usecase, there is nothing special about it.  
> > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and 
> > faulted the memory with a fragmented local node and then measured the 
> > remote access latency to the remote hugepage that occurs without setting 
> > __GFP_THISNODE.  You can also measure the remote allocation latency by 
> > fragmenting the entire system and then faulting.
> > 
> 
> I'll make the same point as before, the form the fragmentation takes
> matters as well as the types of pages that are resident and whether
> they are active or not. It affects the level of work the system does
> as well as the overall success rate of operations (be it reclaim, THP
> allocation, compaction, whatever). This is why a reproduction case that is
> representative of the problem you're facing on the real workload matters
> would have been helpful because then any alternative proposal could have
> taken your workload into account during testing.
> 

We know from Andrea's report that compaction is failing, and repeatedly 
failing because otherwise we would not need excessive swapping to make it 
work.  That can mean one of two things: (1) a general low-on-memory 
situation that causes us repeatedly to be under watermarks to deem 
compaction suitable (isolate_freepages() will be too painful) or (2) 
compaction has the memory that it needs but is failing to make a hugepage 
available because all pages from a pageblock cannot be migrated.

If (1), perhaps in the presence of an antagonist that is quickly 
allocating the memory before compaction can pass watermark checks, further 
reclaim is not beneficial: the allocation is becoming too expensive and 
there is no guarantee that compaction can find this reclaimed memory in 
isolate_freepages().

I chose to duplicate (2) by synthetically introducing fragmentation 
(high-order slab, free every other one) locally to test the patch that 
does not set __GFP_THISNODE.  The result is a remote transparent hugepage, 
but we do not even need to get to the point of local compaction for that 
fallback to happen.  And this is where I measure the 13.9% access latency 
regression for the lifetime of the binary as a result of this patch.

If local compaction works the first time, great!  But that is not what is 
happening in Andrea's report and as a result of not setting __GFP_THISNODE 
we are *guaranteed* worse access latency and may encounter even worse 
allocation latency if the remote memory is fragmented as well.

So while I'm only testing the functional behavior of the patch itself, I 
cannot speak to the nature of the local fragmentation on Andrea's systems.