lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 21 Oct 2016 18:15:51 -0400
From:   Theodore Ts'o <tytso@....edu>
To:     Jens Axboe <axboe@...com>
Cc:     Dave Chinner <david@...morbit.com>, linux-ext4@...r.kernel.org,
        fstests@...r.kernel.org, tarasov@...ily.name
Subject: Re: Test generic/299 stalling forever

On Thu, Oct 20, 2016 at 08:22:00AM -0600, Jens Axboe wrote:
> > So what's happening is that generic/299 is looping in the
> > fallocate/truncate loop until fio exits, but since fio never exits, so
> > it ends up looping forever.
> 
> I'm setting up the GCE now, I've had the tests running for about 24h now
> on another test box and haven't been able to trigger any hangs. I'll
> match your setup as closely as I can, hopefully that'll work.

Any luck reproducing the problem?

On Wed, Oct 19, 2016 at 08:06:44AM -0600, Jens Axboe wrote:
>
> I'll take a look today. I agree, this definitely looks like a fio
> bug. But not related to the mutex issue for the stat part, all verifier
> threads are waiting to be woken up, but the main thread is done.
>

I was taking a closer look at this, and it does look ike it's related
to the stat_mutex.  The main thread (according to gdb) seems to be
stuck in this loop in backend.c line 1738 (in thread_main):

		do {
			check_update_rusage(td);
			if (!fio_mutex_down_trylock(stat_mutex))
				break;
			usleep(1000);   <----- line 1738
		} while (1);

So it looks like it's not able to grab the stat_mutex.  But I can't
figure out how the stat_mutex could be down.  None of the strack
traces seem to show that, and I've looked at all of the places where
stat_mutex is taken, and it doesn't look like stat_mutex should ever
be down for more than, say, a second?

So as a temporary workaround, I'm considering adding a check to see if
we stay stuck in this loop for than a thousand times, and if so, print
an error to stderr and then call _exit(1), or maybe just break out two
levels by jumping to line 1778 at "td_set_runstate(td, TD_FINISHING)"
and just give up on the usage statistics (since for xfstests we really
don't care about the usage stats).

					- Ted

P.S.  I can't see any way this could be happening other than perhaps a
pointer error that corrupted stat_mutex.  I can't see any way a thread
could leave stat_mutex down  WDYT?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ