linux-kernel - Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1911251233020.245192@chino.kir.corp.google.com>
Date:   Mon, 25 Nov 2019 12:38:59 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote
 hugepages

On Mon, 25 Nov 2019, Michal Hocko wrote:

> > So my question would be: if we know the previous behavior that allowed 
> > excessive swap and recalling into compaction was deemed harmful for the 
> > local node, why do we now believe it cannot be harmful if done for all 
> > system memory?
> 
> I have to say that I got lost in your explanation. I have already
> pointed this out in a previous email you didn't reply to. But the main
> difference to previous __GFP_THISNODE behavior is that it is used along
> with __GFP_NORETRY and that reduces the overall effort of the reclaim
> AFAIU. If that is not the case then please be _explicit_ why.
> 

I'm referring to the second allocation in alloc_pages_vma() after the 
patch:

 			/*
 			 * If hugepage allocations are configured to always
 			 * synchronous compact or the vma has been madvised
 			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * memory with both reclaim and compact as well.
 			 */
 			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
 				page = __alloc_pages_node(hpage_node,
- 						gfp | __GFP_NORETRY, order);
+							gfp, order);

So we now do not have __GFP_NORETRY nor __GFP_THISNODE so this bypasses 
all the precautionary logic in the page allocator that avoids excessive 
swap: it is free to continue looping, swapping, and thrashing, trying to 
allocate hugepages if all memory is fragmented.

Qemu uses MADV_HUGEPAGE so this allocation *will* be attempted for 
Andrea's workload.  The swap storms were reported for the same allocation 
but with __GFP_THISNODE so it only occurred for local fragmentation and 
low-on-memory conditions for the local node in the past.  This is now 
opened up for all nodes.

So the question is: what prevents the exact same issue from happening 
again for Andrea's usecase if all memory on the system is fragmented?  I'm 
assuming that if this were tested under such conditions that the swap 
storms would be much worse.