linux-kernel - Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1812051142040.240991@chino.kir.corp.google.com>
Date:   Wed, 5 Dec 2018 11:49:26 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Vlastimil Babka <vbabka@...e.cz>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>, ying.huang@...el.com,
        s.priebe@...fihost.ag, mgorman@...hsingularity.net,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu
Subject: Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation
 regressions

On Wed, 5 Dec 2018, Michal Hocko wrote:

> > The revert is certainly needed to prevent the regression, yes, but I 
> > anticipate that Andrea will report back that patch 2 at least improves the 
> > situation for the problem that he was addressing, specifically that it is 
> > pointless to thrash any node or reclaim unnecessarily when compaction has 
> > already failed.  This is what setting __GFP_NORETRY for all thp fault 
> > allocations fixes.
> 
> Yes but earlier numbers from Mel and repeated again [1] simply show
> that the swap storms are only handled in favor of an absolute drop of
> THP success rate.
>  

As we've been over countless times, this is the desired effect for 
workloads that fit on a single node.  We want local pages of the native 
page size because they (1) are accessed faster than remote hugepages and 
(2) are candidates for collapse by khugepaged.

For applications that do not fit in a single node, we have discussed 
possible ways to extend the API to allow remote faulting of hugepages, 
absent remote fragmentation as well, then the long-standing behavior is 
preserved and large applications can use the API to increase their thp 
success rate.

> Yes, this is understood. So we are getting worst of both. We have a
> numa locality side effect of MADV_HUGEPAGE and we have a poor THP
> utilization. So how come this is an improvement. Especially when the
> reported regression hasn't been demonstrated on a real or repeatable
> workload but rather a very vague presumably worst case behavior where
> the access penalty is absolutely prevailing.
> 

High thp utilization is not always better, especially when those hugepages 
are accessed remotely and introduce the regressions that I've reported.  
Seeking high thp utilization at all costs is not the goal if it causes 
workloads to regress.