linux-kernel - Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53c4a6ca-a4d0-0862-8744-f999b17d82d8@suse.cz>
Date:   Wed, 23 Oct 2019 13:03:36 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Michal Hocko <mhocko@...nel.org>,
        David Rientjes <rientjes@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>
Subject: Re: [patch for-5.3 0/4] revert immediate fallback to remote hugepages

On 10/18/19 4:15 PM, Michal Hocko wrote:
> It's been some time since I've posted these results. The hugetlb issue
> got resolved but I would still like to hear back about these findings
> because they suggest that the current bail out strategy doesn't seem
> to produce very good results. Essentially it doesn't really help THP
> locality (on moderately filled up nodes) and it introduces a strong
> dependency on kswapd which is not a source of the high order pages.
> Also the overral THP success rate is decreased on a pretty standard "RAM
> is used for page cache" workload.
> 
> That makes me think that the only possible workload that might really
> benefit from this heuristic is a THP demanding one on a heavily
> fragmented node with a lot of free memory while other nodes are not
> fragmented and have quite a lot of free memory. If that is the case, is
> this something to optimize for?
> 
> I am keeping all the results for the reference in a condensed form
> 
> On Tue 01-10-19 10:37:43, Michal Hocko wrote:
>> I have split out my kvm machine into two nodes to get at least some
>> idea how these patches behave
>> $ numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 2
>> node 0 size: 475 MB
>> node 0 free: 432 MB
>> node 1 cpus: 1 3
>> node 1 size: 503 MB
>> node 1 free: 458 MB
>>
>> First run with 5.3 and without THP
>> $ echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> root@...t1:~# sh thp_test.sh 
>> 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4
>> 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4
>> 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4
>>
>> So we get around 56-60% pages to the preferred node
>>
>> Now let's enable THPs
>> AnonHugePages:    407552 kB
>> 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4
>> Few more runs
>> AnonHugePages:    407552 kB
>> 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4
>> AnonHugePages:    407552 kB
>> 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4
>>
>> The utilization is again almost 100% and the preferred node usage
>> varied a lot between 47-91%.
>>
>> Now with 5.3 + all 4 patches this time:
>> AnonHugePages:    401408 kB
>> 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4
>> AnonHugePages:    376832 kB
>> 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4
>> AnonHugePages:    372736 kB
>> 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4
>>
>> The THP utilization varies again and the locality is higher in average
>> 76+%.  Which is even higher than the base page case. I was really

I tried to reproduce your setup locally, and got this for THP case
on 5.4-rc4:

AnonHugePages:    395264 kB
7fdc4a2c0000 prefer:1 anon=102400 dirty=102400 N0=48852 N1=53548 kernelpagesize_kB=4
AnonHugePages:    401408 kB
7f27167e2000 prefer:1 anon=102400 dirty=102400 N0=40095 N1=62305 kernelpagesize_kB=4
AnonHugePages:    378880 kB
7ff693ff9000 prefer:1 anon=102400 dirty=102400 N0=58061 N1=44339 kernelpagesize_kB=4

Somewhat better THP utilization and worse node locality than you.

Then I applied a rebased patch that I proposed before (see below):

AnonHugePages:    407552 kB
7f33fa83a000 prefer:1 anon=102400 dirty=102400 N0=28672 N1=73728 kernelpagesize_kB=4
AnonHugePages:    407552 kB
7faac0aa9000 prefer:1 anon=102400 dirty=102400 N0=48869 N1=53531 kernelpagesize_kB=4
AnonHugePages:    407552 kB
7f9f32c57000 prefer:1 anon=102400 dirty=102400 N0=49664 N1=52736 kernelpagesize_kB=4

The THP utilization is now back at 100% as 5.3 (modulo mis-alignment of
the mem_eater area). This is expected, as the second try that's not limited
to __GFP_THISNODE is also not limited by the newly introduced (in 5.4) heuristics 
that checks COMPACT_SKIPPED. Locality seems similar, can't make any
conclusions with such variation and so few tries.
Could you try confirming that as well? Thanks. But I agree the test is
limited and probably depends on timing wrt kswapd making progress.

----8<----
>From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@...e.cz>
Date: Tue, 1 Oct 2019 14:20:58 +0200
Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and
 all-node allocations

THP page faults now attempt a __GFP_THISNODE allocation first, which should
only compact existing free memory, followed by another attempt that can
allocate from any node using reclaim/compaction effort specified by global
defrag setting and madvise.

This patch makes the following changes to the scheme:

- before the patch, the first allocation relies on a check for pageblock order
  and __GFP_IO to prevent excessive reclaim. This however affects also the
  second attempt, which is not limited to single node. Instead of that, reuse
  the existing check for costly order __GFP_NORETRY allocations, and make sure
  the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order
  __GFP_NORETRY allocations will bail out if compaction needs reclaim, while
  previously they only bailed out when compaction was deferred due to previous
  failures. This should be still acceptable within the __GFP_NORETRY semantics.

- before the patch, the second allocation attempt (on all nodes) was passing
  __GFP_NORETRY. This is redundant as the check for pageblock order (discussed
  above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means
  some effort to allocate THP is requested. After this patch, the second
  attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY.

To sum up, THP page faults now try the following attempt:

1. local node only THP allocation with no reclaim, just compaction.
2. THP allocation from any node with effort determined by global defrag setting
   and VMA madvise
3. fallback to base pages on any node

Signed-off-by: Vlastimil Babka <vbabka@...e.cz>
---
 mm/mempolicy.c  | 16 +++++++++-------
 mm/page_alloc.c | 24 +++++-------------------
 2 files changed, 14 insertions(+), 26 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..2c48146f3ee2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
+			/*
+			 * First, try to allocate THP only on local node, but
+			 * don't reclaim unnecessarily, just compact.
+			 */
 			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
 
 			/*
-			 * If hugepage allocations are configured to always
-			 * synchronous compact or the vma has been madvised
-			 * to prefer hugepage backing, retry allowing remote
-			 * memory as well.
+			 * If that fails, allow both compaction and reclaim,
+			 * but on all nodes.
 			 */
-			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+			if (!page)
 				page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_NORETRY, order);
+								gfp, order);
 
 			goto out;
 		}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ecc3dbad606b..36d7d852f7b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		 if (order >= pageblock_order && (gfp_mask & __GFP_IO) &&
-		     !(gfp_mask & __GFP_RETRY_MAYFAIL)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes some THP page fault allocations
+		 */
+		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
 			/*
 			 * If allocating entire pageblock(s) and compaction
 			 * failed because all zones are below low watermarks
@@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			if (compact_result == COMPACT_SKIPPED ||
 			    compact_result == COMPACT_DEFERRED)
 				goto nopage;
-		}
-
-		/*
-		 * Checks for costly allocations with __GFP_NORETRY, which
-		 * includes THP page fault allocations
-		 */
-		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
-			/*
-			 * If compaction is deferred for high-order allocations,
-			 * it is because sync compaction recently failed. If
-			 * this is the case and the caller requested a THP
-			 * allocation, we do not want to heavily disrupt the
-			 * system, so we fail the allocation instead of entering
-			 * direct reclaim.
-			 */
-			if (compact_result == COMPACT_DEFERRED)
-				goto nopage;
 
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
-- 
2.23.0