[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131211010224.GN11295@suse.de>
Date: Wed, 11 Dec 2013 01:02:24 +0000
From: Mel Gorman <mgorman@...e.de>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Dave Jones <davej@...hat.com>,
Darren Hart <dvhart@...ux.intel.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: process 'stuck' at exit.
On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
>
> > Hmm. Looks like the futex code is somehow stuck in a loop, calling
> > get_user_pages_fast().
> >
> > The futex code itself is apparently so low-overhead that it doesn't
> > show up in your 'perf top' report (which is dominated by all the
> > expensive debug things that get_user_pages_fast() etc ends up doing),
> > but that's the only looping I can see. Perhaps the "goto again" case
> > for transparent huge pages in get_futex_key()? Or the
>
> Cc'ng more folks on that.
>
I just saw this before heading to bed and have not read the thread. I'll
read it in the morning but in the meantime the following might ring a bell
for someone elses investigation or someone more familiar with how futexs
work from end to end.
Was NUMA balancing enabled and was this a NUMA machine?
I ask because of these two patches that are currently in flight
mm: numa: Serialise parallel get_user_page against THP migration mm
fix TLB flush race between migration, and change_protection_range
There are related patches but these two are the most important for what
I have in mind. The two in combination address a problem whereby a write
from one thread can be lost due to a THP migration but it's specific to
automatic NUMA balancing. If the lost update was for a page containing a
futex then the lost write could confuse waiters. The downside is that this
is a bad fit for the problem description in the first mail. A lost update
might result in processes waiting forever on a value that never changes
but offhand it's less clear why it might result in a loop. Unless of
course there is a combination of events that allows for a busy wait on a
value that will never change due to the lost write.
--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists