linux-ext4 - Re: xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1 kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160621101156.GA6299@quack2.suse.cz>
Date:	Tue, 21 Jun 2016 12:11:56 +0200
From:	Jan Kara <jack@...e.cz>
To:	Nikola Pajkovsky <nikola.pajkovsky@...ddata.com>
Cc:	Jan Kara <jack@...e.cz>,
	Holger Hoffstätte 
	<holger@...lied-asynchrony.com>, linux-ext4@...r.kernel.org,
	Jan Kara <jack@...e.com>
Subject: Re: xfstests generic/130 hang with non-4k block size ext4 on 4.7-rc1
 kernel

On Mon 20-06-16 14:59:57, Nikola Pajkovsky wrote:
> Jan Kara <jack@...e.cz> writes:
> > On Thu 16-06-16 16:42:58, Nikola Pajkovsky wrote:
> >> Jan Kara <jack@...e.cz> writes:
> >> 
> >> > On Fri 10-06-16 07:52:56, Nikola Pajkovsky wrote:
> >> >> Jan Kara <jack@...e.cz> writes:
> >> >> > On Thu 09-06-16 09:23:29, Nikola Pajkovsky wrote:
> >> >> >> Holger Hoffstätte <holger@...lied-asynchrony.com> writes:
> >> >> >> 
> >> >> >> > On Wed, 08 Jun 2016 14:56:31 +0200, Jan Kara wrote:
> >> >> >> > (snip)
> >> >> >> >> Attached patch fixes the issue for me. I'll submit it once a full xfstests
> >> >> >> >> run finishes for it (which may take a while as our server room is currently
> >> >> >> >> moving to a different place).
> >> >> >> >> 
> >> >> >> >> 								Honza
> >> >> >> >> -- 
> >> >> >> >> Jan Kara <jack@...e.com>
> >> >> >> >> SUSE Labs, CR
> >> >> >> >> From 3a120841a5d9a6c42bf196389467e9e663cf1cf8 Mon Sep 17 00:00:00 2001
> >> >> >> >> From: Jan Kara <jack@...e.cz>
> >> >> >> >> Date: Wed, 8 Jun 2016 10:01:45 +0200
> >> >> >> >> Subject: [PATCH] ext4: Fix deadlock during page writeback
> >> >> >> >> 
> >> >> >> >> Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
> >> >> >> >> deadlock in ext4_writepages() which was previously much harder to hit.
> >> >> >> >> After this commit xfstest generic/130 reproduces the deadlock on small
> >> >> >> >> filesystems.
> >> >> >> >
> >> >> >> > Since you marked this for -stable, just a heads-up that the previous patch
> >> >> >> > for the data exposure was rejected from -stable (see [1]) because it
> >> >> >> > has the mismatching "!IS_NOQUOTA(inode) &&" line, which didn't exist
> >> >> >> > until 4.6. I removed it locally but Greg probably wants an official patch.
> >> >> >> >
> >> >> >> > So both this and the previous patch need to be submitted.
> >> >> >> >
> >> >> >> > [1] http://permalink.gmane.org/gmane.linux.kernel.stable/18074{4,5,6}
> >> >> >> 
> >> >> >> I'm just wondering if the Jan's patch is not related to blocked
> >> >> >> processes in following trace. It very hard to hit it and I don't have
> >> >> >> any reproducer.
> >> >> >
> >> >> > This looks like a different issue. Does the machine recover itself or is it
> >> >> > a hard hang and you have to press a reset button?
> >> >> 
> >> >> The machine is bit bigger than I have pretend. It's 18 vcpu with 160 GB
> >> >> ram and machine has dedicated mount point only for PostgreSQL data.
> >> >> 
> >> >> Nevertheless, I was able always to ssh to the machine, so machine itself
> >> >> was not in hard hang and ext4 mostly gets recover by itself (it took
> >> >> 30min). But I have seen situation, were every process who 'touch' the ext4
> >> >> goes immediately to D state and does not recover even after hour.
> >> >
> >> > If such situation happens, can you run 'echo w >/proc/sysrq-trigger' to
> >> > dump stuck processes and also run 'iostat -x 1' for a while to see how much
> >> > IO is happening in the system? That should tell us more.
> >> 
> >> 
> >> Link to 'echo w >/proc/sysrq-trigger' is here, because it's bit bigger
> >> to mail it.
> >> 
> >>    http://expirebox.com/download/68c26e396feb8c9abb0485f857ccea3a.html
> >
> > Can you upload it again please? I've got to looking at the file only today
> > and it is already deleted. Thanks!
> 
> http://expirebox.com/download/c010e712e55938435c446cdc01a0b523.html

OK, I had a look into the traces and JBD2 thread just waits for the buffers
is has submitted for IO to complete. The rest is just blocked on that. From
the message "INFO: task jbd2/vdc-8:4710 blocked for more than 120 seconds.
severity=err" we can see that the JBD2 process has been waiting for a
significant amount of time. Now the question is why it takes so long for
the IO to complete - likely not a fs problem but somewhere below - block
layer or the storage itself.

What is the underlying storage? And what IO scheduler do you use? Seeing
that the device is 'vdc' - that suggests you are running in a guest - is
there anything interesting happening on the host at that moment? Is IO from
other guests / the host stalled at that moment as well?

> >> I was running iotop and there was traffic roughly ~20 KB/s write.
> >> 
> >> What was bit more interesting, was looking at
> >> 
> >>    cat /proc/vmstat | egrep "nr_dirty|nr_writeback"
> >> 
> >> nr_drity had around 240 and was slowly counting up, but nr_writeback had
> >> ~8800 and was stuck for 120s.
> >
> > Hum, interesting. This would suggest like IO completion got stuck for some
> > reason. We'll see more from the stacktraces hopefully.
> 
> I have monitor /sys/kernel/debug/bdi/253:32/stats for 10 mins per 1 sec.
> Values are all same as follows:
> 
>     --[ Sun Jun 19 06:11:08 CEST 2016
>     BdiWriteback:            15840 kB
>     BdiReclaimable:          32320 kB
>     BdiDirtyThresh:              0 kB
>     DirtyThresh:           1048576 kB
>     BackgroundThresh:       131072 kB
>     BdiDirtied:         6131163680 kB
>     BdiWritten:         6130214880 kB
>     BdiWriteBandwidth:      324948 kBps
>     b_dirty:                     2
>     b_io:                        3
>     b_more_io:                   0
>     bdi_list:                    1
>     state:                       c

OK, so all the IO looks stalled for that period of it.

> Maybe those values can cause issue and kicks in writeback to often and
> block everyone else.
> 
>    $ sysctl -a | grep dirty | grep -v ratio
>    vm.dirty_background_bytes = 134217728
>    vm.dirty_bytes = 1073741824
>    vm.dirty_expire_centisecs = 1500
>    vm.dirty_writeback_centisecs = 500

This looks healthy.

> I even have output of command, if you're interested.
> 
>   $ trace-cmd record -e ext4 -e jbd2 -e writeback -e block sleep 600

Traces from block layer may be interesting but you'd need the trace started
before the hang starts so that you see what happened with the IO that
jbd2/vdc-8:4710 is waiting for.

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html