linux-kernel - Re: process 'stuck' at exit.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131211010224.GN11295@suse.de>
Date:	Wed, 11 Dec 2013 01:02:24 +0000
From:	Mel Gorman <mgorman@...e.de>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Dave Jones <davej@...hat.com>,
	Darren Hart <dvhart@...ux.intel.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: process 'stuck' at exit.

On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
> 
> > Hmm. Looks like the futex code is somehow stuck in a loop, calling
> > get_user_pages_fast().
> > 
> > The futex code itself is apparently so low-overhead that it doesn't
> > show up in your 'perf top' report (which is dominated by all the
> > expensive debug things that get_user_pages_fast() etc ends up doing),
> > but that's the only looping I can see. Perhaps the "goto again" case
> > for transparent huge pages in get_futex_key()? Or the
> 
> Cc'ng more folks on that.
> 

I just saw this before heading to bed and have not read the thread. I'll
read it in the morning but in the meantime the following might ring a bell
for someone elses investigation or someone more familiar with how futexs
work from end to end.

Was NUMA balancing enabled and was this a NUMA machine?

I ask because of these two patches that are currently in flight

  mm: numa: Serialise parallel get_user_page against THP migration mm
  fix TLB flush race between migration, and change_protection_range

There are related patches but these two are the most important for what
I have in mind. The two in combination address a problem whereby a write
from one thread can be lost due to a THP migration but it's specific to
automatic NUMA balancing. If the lost update was for a page containing a
futex then the lost write could confuse waiters. The downside is that this
is a bad fit for the problem description in the first mail. A lost update
might result in processes waiting forever on a value that never changes
but offhand it's less clear why it might result in a loop. Unless of
course there is a combination of events that allows for a busy wait on a
value that will never change due to the lost write.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/