linux-kernel - Re: [PATCH] mm: Do not stall in synchronous compaction for THP allocations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111110161331.GG3083@suse.de>
Date:	Thu, 10 Nov 2011 16:13:31 +0000
From:	Mel Gorman <mgorman@...e.de>
To:	Minchan Kim <minchan.kim@...il.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, Jan Kara <jack@...e.cz>,
	Andy Isaacson <adi@...apodia.org>,
	Johannes Weiner <jweiner@...hat.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH] mm: Do not stall in synchronous compaction for THP
 allocations

On Fri, Nov 11, 2011 at 12:12:01AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> You should have Cced with me because __GFP_NORETRY is issued by me.
> 

I don't understand. __GFP_NORETRY has been around a long time ago
and has loads of users. Do you have particular concerns about the
behaviour of __GFP_NORETRY?

> On Thu, Nov 10, 2011 at 11:22 PM, Mel Gorman <mgorman@...e.de> wrote:
> > On Thu, Nov 10, 2011 at 10:06:16AM +0000, Mel Gorman wrote:
> >> than stall. It was suggested that __GFP_NORETRY be used instead of
> >> __GFP_NO_KSWAPD. This would look less like a special case but would
> >> still cause compaction to run at least once with sync compaction.
> >>
> >
> > This comment is bogus - __GFP_NORETRY would have caught THP allocations
> > and would not call sync compaction. The issue was that it would also
> > have caught any hypothetical high-order GFP_THISNODE allocations that
> > end up calling compaction here
> 
> In fact, the I support patch concept so I would like to give
> 
> Acked-by: Minchan Kim <minchan.kim@...il.com>
> But it is still doubt about code.
> 
> __GFP_NORETRY: The VM implementation must not retry indefinitely
> 
> What could people think if they look at above comment?
> At least, I can imagine two
> 
> First, it is related on *latency*.

I never read it to mean that. I always read it to mean "it must
not retry indefinitely". I expected it was still fine to try direct
reclaim which is not a latency-sensitive operation.

> Second, "I can handle if VM fails allocation"
> 

Fully agreed.

> I am biased toward latter.
> Then, __GFP_NO_KSWAPD is okay? It means "let's avoid sync compaction
> or long latency"?

and do not wake kswapd to swap additional pages.

> It's rather awkward name. Already someone started to use
> __GFP_NO_KSWAPD as such purpose.
> See mtd_kmalloc_up_to. He mentioned in comment of function as follows,
> 
>  * the system page size. This attempts to make sure it does not adversely
>  * impact system performance, so when allocating more than one page, we
>  * ask the memory allocator to avoid re-trying, swapping, writing back
>  * or performing I/O.
> 

I was not aware of this user although their reason for using it
seems fine. We are not breaking their expectations with this patch but
it is a slippery slope.

> That thing was what I concerned.
> In future, new users of __GFP_NO_KSWAPD is coming and we can't prevent
> them under our sight.
> So I hope we can change the flag name or fix above code and comment
> out __GFP_NO_KSWAPD
> 

I can't think of a better name but we can at least improve the comment.

> /*
>  * __GFP_NO_KSWAPD is very VM internal flag so Please don't use it
> without allowing mm guys
>  *
> #define __GFP_NO_KSWAPD xxxx
> 
> >
> >                /*
> >                 * High-order allocations do not necessarily loop after
> >                 * direct reclaim and reclaim/compaction depends on
> >                 * compaction being called after reclaim so call directly if
> >                 * necessary
> >                 */
> >                page = __alloc_pages_direct_compact(gfp_mask, order,
> >                                        zonelist, high_zoneidx,
> >                                        nodemask,
> >                                        alloc_flags, preferred_zone,
> >                                        migratetype, &did_some_progress,
> >                                        sync_migration);
> >
> > __GFP_NORETRY is used in a bunch of places and while the most
> > of them are not high-order, some of them potentially are like in
> > sound/core/memalloc.c. Using __GFP_NO_KSWAPD as the flag allows
> > these callers to continue using sync compaction.  It could be argued
> 
> Okay. If I was biased first, I have opposed this comment because they
> might think __GFP_NORETRY is very latency sensitive.

As __GFP_NORETRY uses direct reclaim, I do not think users expect it
to be latency sensitive. If they were latency sensitive, they would
mask out __GFP_WAIT.

> So they wanted allocation is very fast without any writeback/retrial.
> In view point, __GFP_NORETRY isn't bad, I think.
> 
> Having said that, I was biased latter, as I said earlier.
> 
> > that they would prefer __GFP_NORETRY but the potential side-effects
> > should be taken should be taken into account and the comment updated
> 
> Considering side-effect, your patch is okay.
> But I can't understand you mentioned "the comment updated if that
> happens" sentence. :(
> 

I meant the comment above sync_migration = !(gfp_mask & __GFP_NO_KSWAPD) should
be updated if it changes to __GFP_NORETRY.

I updated the commentary a bit. Does this help?

==== CUT HERE ====
mm: Do not stall in synchronous compaction for THP allocations

Changelog since V1
  o Update changelog with more detail
  o Add comment on __GFP_NO_KSWAPD

Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;

Andy Isaacson reported a problem copying to VFAT on SD Card
	https://lkml.org/lkml/2011/11/7/2

	In this case, it was stuck in munmap for betwen 20 and 60
	seconds in compaction. It is also possible that khugepaged
	was holding mmap_sem on this process if CONFIG_NUMA was set.

Johannes Weiner reported stalls on USB
	https://lkml.org/lkml/2011/7/25/378

	In this case, there is no stack trace but it looks like the
	same problem. The USB stick may have been using NTFS as a
	filesystem based on other work done related to writing back
	to USB around the same time.

Internally in SUSE, I received a bug report related to stalls in firefox
	when using Java and Flash heavily while copying from NFS
	to VFAT on USB. It has not been confirmed to be the same problem
	but if it looks like a duck and quacks like a duck.....

In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.

While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed

Time stalled in this event:    47369 ms
Event count:                      20
usemem               sleep_on_page          3690 ms
usemem               sleep_on_page          2148 ms
usemem               sleep_on_page          1534 ms
usemem               sleep_on_page          1518 ms
usemem               sleep_on_page          1225 ms
usemem               sleep_on_page          2205 ms
usemem               sleep_on_page          2399 ms
usemem               sleep_on_page          2398 ms
usemem               sleep_on_page          3760 ms
usemem               sleep_on_page          1861 ms
usemem               sleep_on_page          2948 ms
usemem               sleep_on_page          1515 ms
usemem               sleep_on_page          1386 ms
usemem               sleep_on_page          1882 ms
usemem               sleep_on_page          1850 ms
usemem               sleep_on_page          3715 ms
usemem               sleep_on_page          3716 ms
usemem               sleep_on_page          4846 ms
usemem               sleep_on_page          1306 ms
usemem               sleep_on_page          1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30

The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.

This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall.

It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD
to look less like a special case. This would prevent THP allocation
using sync compaction but it would have other side-effects. There are
existing users of __GFP_NORETRY that are doing high-order allocations
and while they can handle allocation failure, it seems reasonable that
they continue to use sync compaction unless there is a deliberate
reason to change that. To help clarify this for the future, this
patch updates the comment for __GFP_NO_KSWAPD.

If accepted, this is a -stable candidate.

Reported-by: Andy Isaacson <adi@...apodia.org>
Reported-and-tested-by: Johannes Weiner <hannes@...xchg.org>
Reviewed-by: Andrea Arcangeli <aarcange@...hat.com>
Signed-off-by: Mel Gorman <mgorman@...e.de>
---
 include/linux/gfp.h |    8 ++++++++
 mm/page_alloc.c     |    9 ++++++++-
 2 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..4470a3c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -83,7 +83,15 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
+/*
+ * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation
+ * over excessive disruption of the system. Currently this means
+ * 1. Do not wake kswapd (hence the flag name)
+ * 2. Do not use stall in synchronous compaction for high-order allocations
+ *    as this may cause the caller to stall writing out pages
+ */
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
+
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e8ecb6..a3b5da6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2161,7 +2161,14 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+
+	/*
+	 * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate
+	 * the system should not be heavily disrupted. In practice, this is
+	 * to avoid THP callers being stalled in writeback during migration
+	 * as it's preferable for the the allocations to fail than to stall
+	 */
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/