[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.02.1312102157230.28330@ionos.tec.linutronix.de>
Date: Tue, 10 Dec 2013 22:17:22 +0100 (CET)
From: Thomas Gleixner <tglx@...utronix.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
cc: Dave Jones <davej@...hat.com>,
Darren Hart <dvhart@...ux.intel.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Mel Gorman <mgorman@...e.de>, Oleg Nesterov <oleg@...hat.com>
Subject: Re: process 'stuck' at exit.
On Tue, 10 Dec 2013, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> >
> > The -EAGAIN is when the user value changed, simplified:
>
> No it's not.
>
> Thomas, stop this crap already. Look at the f*cking code carefully
> instead of just dismissing cases.
>
> The worrisome EAGAIN case is
>
> futex_requeue
> futex_proxy_trylock_atomic
> futex_lock_pi_atomic
> lookup_pi_state:
> ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
>
> and now futex_requeue() will do "goto repeat" for that EAGAIN case.
>
> So, Christ, Thomas, you have now *twice* dismissed a real concern with
> totally bogus "that can never happen" by explaining some totally
> unrelated *simple* case rather than the much more complex case.
>
> So please. Really. Truly look at the code and thing about it, or shut
> the f*ck up. No more of this shit where you glance at the code, find
> some simple case, and say "that can't happen", and dismiss the
> bug-report.
Well, I spent a fricking long time to work on that code and find the
absurdest bugs in it and I'm well aware of the exit issue.
> So far Dave's bug-reports have generally pretty much universally shown
> real bugs. Being dismissive about it is not helpful, quite the
> reverse.
I know and I used that information more than once to carefully track
down the real reason. I never dismissed a single report.
> Maybe the loop I'm pointing at cannot happen, but *your* explanation
> for why it couldn't happen was pure and utter garbage, and was clearly
> because you hadn't even bothered to look at all the cases.
I might have been sloppy and not really explaining why I think, that
the requeue_pi exit case is not likely.
To make that loop happen it requires the following:
1) An actual user of requeue_pi, which can only be the fuzzer as glibc
does not support it. But Daves last fuzzed syscall was something
else.
2) The report said, that the last fuzzing test was already done and
the fuzzer app is about to exit.
That involves futex_requeue from deep inside glibc thread
library. And that is NOT using REQUEUE_PI.
3) So now even IF it would use requeue_pi then this would require,
that the outer PI lock is already held by some other task which is
already in the process of exiting. IOW, this lock would be held by
something which did not release it before exit and already set
PF_EXITING in exit_signals() and then got stuck for ever before
reaching: tsk->flags |= PF_EXITPIDONE;
I still think, that it is highly unlikely, but to make sure I already
asked Dave before reading your rant to fire up the tracer, so we know
for sure where hell the thing is looping.
Thanks,
tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists