[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1190668846.13955.15.camel@dyn9047017100.beaverton.ibm.com>
Date: Mon, 24 Sep 2007 14:20:45 -0700
From: Badari Pulavarty <pbadari@...il.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Andy Whitcroft <apw@...dowen.org>, sct@...hat.com,
Andrew Morton <akpm@...ux-foundation.org>,
adilger@...sterfs.com, ext4 <linux-ext4@...r.kernel.org>,
lkml <linux-kernel@...r.kernel.org>, mel@....ul.ie
Subject: Re: 2.6.23-rc6: hanging ext3 dbench tests
On Mon, 2007-09-24 at 13:04 -0700, Linus Torvalds wrote:
>
> On Mon, 24 Sep 2007, Badari Pulavarty wrote:
> >
> > Whats happening on my machine is ..
> >
> > dbench forks of 4 children and sends them a signal to start the work.
> > 3 out of 4 children gets the signal and does the work. One of the child
> > never gets the signal so, it waits forever in pause(). So, parent waits
> > for a longtime to kill it.
>
> Since this *seems* to have nothing to do with the filesystem, and since it
> *seems* to have been introduced between -rc3 and -rc4, I did
>
> gitk v2.6.23-rc3..v2.6.23-rc4 -- kernel/
I was wrong. I managed to reproduce on 2.6.23-rc3, but it took a long
time. But I never reproduced it on 2.6.22. Ran test for a day.
>
> to see what has changed. One of the commits was signal-related, and that
> one doesn't look like it could possibly matter.
>
> The rest were scheduler-related, which doesn't surprise me. In fact, even
> before I looked, my reaction to your bug report was "That sounds like an
> application race condition".
>
> Applications shouldn't use "pause()" for waiting for a signal. It's a
> fundamentally racy interface - the signal could have happened just
> *before* calling pause. So it's almost always a bug to use pause(), and
> any users should be fixed to use "sigsuspend()" instead, which can
> atomically (and correctly) pause for a signal while the process has masked
> it outside of the system call.
>
> Now, I took a look at the dbench sources, and I have to say that the race
> looks *very* unlikely (there's quite a small window in which it does
>
> children[i].status = getpid();
> ** race window here **
> pause();
>
> and it would require *just* the right timing so that the parent doesn't
> end up doing the "sleep(1)" (which would make the window even less likely
> to be hit), but there does seem to be a race condition there. And it
> *could* be that you just happen to hit it on your hw setup.
>
> So before you do anything else, does this patch (TOTALLY UNTESTED! DONE
> ENTIRELY LOOKING AT THE SOURCE! IT MAY RAPE ALL YOUR PETS, AND CALL YOU
> BAD NAMES!) make any difference?
>
> (patch against unmodified dbench-2.0)
I am testing the updated version of dbench now. Normally, it takes
30min-1hour to reproduce the problem (when I do infinite "dbench 4").
I will post the results soon.
Thanks,
Badari
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists