linux-kernel - Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1812031226510.192288@chino.kir.corp.google.com>
Date:   Mon, 3 Dec 2018 12:36:09 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
cc:     Linus Torvalds <torvalds@...ux-foundation.org>, mhocko@...nel.org,
        ying.huang@...el.com, s.priebe@...fihost.ag,
        mgorman@...hsingularity.net,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu, Vlastimil Babka <vbabka@...e.cz>
Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3%
 regression

On Mon, 3 Dec 2018, Andrea Arcangeli wrote:

> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a
> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.
> 
> The "order == pageblock_order" hardcoding inside the allocator to
> workaround the __GFP_THISNODE flag passed from outside the allocator
> in the THP MADV_HUGEPAGE case, didn't look very attractive because
> it's not just THP allocating order >0 pages.
> 

We have two different things to consider: NUMA locality and the order of 
the allocation.  THP is preferred locally and we know the order.  For the 
other high-order pages you're referring to, I don't know if they are using 
__GFP_THISNODE or not (likely not).  I see them as two separate issues.

For thp on all platforms I have measured it on specifically for this patch 
(Broadwell, Haswell, Rome) there is a clear advantage to faulting local 
pages of the native page size over remote hugepages.  It also has the 
added effect of allowing khugepaged to collapse it into a hugepage later 
if fragmentation allows (the reason why khugepaged cares about NUMA 
locality, the same reason I do).  This is the rationale for __GFP_THISNODE 
for thp allocations.

For order == pageblock_order (or more correctly order >= pageblock_order), 
this is not based on NUMA whatsoever but is rather based on the 
implementation of memory compaction.  If it has already failed (or was 
deferred for order-HPAGE_PMD_ORDER), reclaim cannot be shown to help if 
memory compaction cannot utilize the freed memory in isolate_freepages(), 
so that reclaim has been pointless.  If compaction fails for other reasons 
(any unmovable page preventing a pageblock from becoming free), *all* 
reclaim activity has been pointless.  

> It'd be nicer if whatever compaction latency optimization that applies
> to THP could also apply to all other allocation orders too and the
> hardcoding of the THP order prevents that.
> 
> On the same lines if __GFP_THISNODE is so badly needed by
> MADV_HUGEPAGE, all other larger order allocations should also be able
> to take advantage of __GFP_THISNODE without ending in the same VM
> corner cases that required the "order == pageblock_order" hardcoding
> inside the allocator.
> 
> If you prefer David's patch I would suggest pageblock_order to be
> replaced with HPAGE_PMD_ORDER so it's more likely to match the THP
> order in all archs.
> 

That sounds fine and I will do that in my v2.