linux-kernel - Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1812071450270.173448@chino.kir.corp.google.com>
Date:   Fri, 7 Dec 2018 15:05:28 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        mgorman@...hsingularity.net, Vlastimil Babka <vbabka@...e.cz>,
        ying.huang@...el.com, s.priebe@...fihost.ag,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        alex.williamson@...hat.com, lkp@...org, kirill@...temov.name,
        Andrew Morton <akpm@...ux-foundation.org>,
        zi.yan@...rutgers.edu
Subject: Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling
 into alloc_hugepage_direct_gfpmask"

On Fri, 7 Dec 2018, Michal Hocko wrote:

> > This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.
> > 
> > There are a couple of issues with 89c83fb539f9 independent of its partial 
> > revert in 2f0799a0ffc0 ("mm, thp: restore node-local hugepage 
> > allocations"):
> > 
> > Firstly, the interaction between alloc_hugepage_direct_gfpmask() and 
> > alloc_pages_vma() is racy wrt __GFP_THISNODE and MPOL_BIND.  
> > alloc_hugepage_direct_gfpmask() makes sure not to set __GFP_THISNODE for 
> > an MPOL_BIND policy but the policy used in alloc_pages_vma() may not be 
> > the same for shared vma policies, triggering the WARN_ON_ONCE() in 
> > policy_node().
> 
> Could you share a test case?
> 

Sorry, as Vlastimil pointed out this race does not exist anymore since 
commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") 
since it removed the restructuring of alloc_hugepage_direct_gfpmask().  It 
existed prior to this commit for shared vma policies that could modify the 
policy between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() if 
the policy switches to MPOL_BIND in that window.

> > Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat 
> > different policy for hugepage allocations, which were allocated through 
> > alloc_hugepage_vma().  For hugepage allocations, if the allocating 
> > process's node is in the set of allowed nodes, allocate with 
> > __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with 
> > __GFP_THISNODE instead).
> 
> Why is it wrong to fallback to an explicitly configured mbind mask?
> 

The new_page() case is similar to the shmem_alloc_hugepage() case.  Prior 
to 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into 
alloc_hugepage_direct_gfpmask"), shmem_alloc_hugepage() did 
alloc_pages_vma() with hugepage == true, which effected a different 
allocation policy: if the node current is running on is allowed by the 
policy, use __GFP_THISNODE (considering ac5b2c18911ff is reverted, which 
it is in Linus's tree).

After 89c83fb539f9, we lose that and can fallback to remote memory.  Since 
the discussion is on-going wrt the NUMA aspects of hugepage allocations, 
it's better to have a stable 4.20 tree while that is being worked out and 
likely deserves separate patches for both new_page() and 
shmem_alloc_hugepage().  For the latter specifically, I assume it would be 
nice to get an Acked-by by Kirill who implemented shmem_alloc_hugepage() 
with hugepage == true back in 4.8 that also had the __GFP_THISNODE 
behavior before the allocation policy is suddenly changed.