linux-kernel - Re: [PATCH 1/2] sched/wait: Break up long wake list walk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170818122339.24grcbzyhnzmr4qw@techsingularity.net>
Date:   Fri, 18 Aug 2017 13:23:39 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     "Liang, Kan" <kan.liang@...el.com>, Mel Gorman <mgorman@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...e.hu>, Andi Kleen <ak@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>, Jan Kara <jack@...e.cz>,
        linux-mm <linux-mm@...ck.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk

On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <kan.liang@...el.com> wrote:
> >
> > Here is the call stack of wait_on_page_bit_common
> > when the queue is long (entries >1000).
> >
> > # Overhead  Trace output
> > # ........  ..................
> > #
> >    100.00%  (ffffffff931aefca)
> >             |
> >             ---wait_on_page_bit
> >                __migration_entry_wait
> >                migration_entry_wait
> >                do_swap_page
> >                __handle_mm_fault
> >                handle_mm_fault
> >                __do_page_fault
> >                do_page_fault
> >                page_fault
> 
> Hmm. Ok, so it does seem to very much be related to migration. Your
> wake_up_page_bit() profile made me suspect that, but this one seems to
> pretty much confirm it.
> 
> So it looks like that wait_on_page_locked() thing in
> __migration_entry_wait(), and what probably happens is that your load
> ends up triggering a lot of migration (or just migration of a very hot
> page), and then *every* thread ends up waiting for whatever page that
> ended up getting migrated.
> 

Agreed.

> And so the wait queue for that page grows hugely long.
> 

It's basically only bounded by the maximum number of threads that can exist.

> Looking at the other profile, the thing that is locking the page (that
> everybody then ends up waiting on) would seem to be
> migrate_misplaced_transhuge_page(), so this is _presumably_ due to
> NUMA balancing.
> 

Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be part
of the picture.

> Does the problem go away if you disable the NUMA balancing code?
> 
> Adding Mel and Kirill to the participants, just to make them aware of
> the issue, and just because their names show up when I look at blame.
> 

I'm not imagining a way of dealing with this that would reliably detect
when there are a large number of waiters without adding a mess. We could
adjust the scanning rate to reduce the problem but it would be difficult
to target properly and wouldn't prevent the problem occurring with the
added hassle that it would now be intermittent.

Assuming the problem goes away by disabling NUMA then it would be nice if it
could be determined that the page lock holder is trying to allocate a page
when the queue is huge. That is part of the operation that potentially
takes a long time and may be why so many callers are stacking up. If
so, I would suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in
migrate_misplaced_transhuge_page and assume that a remote hit is always
going to be cheaper than compacting memory to successfully allocate a
THP. That may be worth doing unconditionally because we'd have to save a
*lot* of remote misses to offset compaction cost.

Nothing fancy other than needing a comment if it works.

diff --git a/mm/migrate.c b/mm/migrate.c
index 627671551873..87b0275ddcdb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
+		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & ~__GFP_DIRECT_RECLAIM,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;

-- 
Mel Gorman
SUSE Labs