linux-ext4 - Re: Ext4 jbd2 state lock race condition

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20160326152129.GC4539@thunk.org>
Date:	Sat, 26 Mar 2016 11:21:29 -0400
From:	Theodore Ts'o <tytso@....edu>
To:	Da-Chang Guan <dcguan@...il.com>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: Ext4 jbd2 state lock race condition

On Fri, Mar 25, 2016 at 06:32:47PM +0800, Da-Chang Guan wrote:
> Hi, all,
> 
>   We have a 4 core Android device has system hang issue. The stack
> trace shows system hang may caused by jbd2 state lock racing.

So this is an ancient kernel (3.7.2) --- which is extremely old.  It's
not even a stable kernel, and in fact starting this year I stopped
caring about 3.10 kernels since while it was disgraceful we are
shipping phones in 2016 using kernels dating from 2013, there are no
mobile devices I care about that will be using anything older than
3.18 going forward.

So just to set your expectations, as upstream developers we generally
only support the latest upstream kernels.  Because I've been doing
some work to add ext4 encrpytion support into Android, for a while I
suffered having to support 3.10 based device kernels.  At this point,
though, I personally have little or no interest for kernels older than
3.18.

In terms of trying to debug this, if you can reproduce the bug, you'll
be in much better shape.  Also, if you have a serial conosle and
CONFIG_MAGIC_SYSRQ is enabled, I'd suggest getting stack traces of all
the CPU's so you can see who else might be holding the lock.  If you
can't reproduce the problem, and you can't get the stack traces for
all the CPU's using the magic sysrq, I doubt there's much that can be
done to reproduce the problem.

May I suggest upgrading to at least 3.18, preferably the latest stable
kernel, which as of this writing is 3.18.29?  I am running regression
tests on 3.18, and making sure that critical bug fixes are getting
back ported to 4.4 and 3.18.  (With 3.14 and 3.10 happening if I have
time and if it's not too difficult, but starting this year, those two
kernels are much lower priority for me.)

Best regards,

>    We want to know who acquires the lock at that time so we can fix
> it.  But we don't even know how to start debug.

If you can reproduce the problem, using CONFIG_LOCKDEP will be very
helpful.  Also perhaps useful would be to build 3.7.2 on x86, and then
use xfstests to try flush out bugs.  I'm sure you will find them ---
when I first started testing a 3.10-based msm kernel, I was able to
trivially trigger kernel crashes using kvm-xfstests.  I think you'll
find it is much easier to find the bugs on x86, and then fix up the
kernel so it's not crashing there, and then see if that addresses your
problem on arm, because there is a much more powerful testing
infrastructure you can use for x86.  See:

	       http://thunk.org/gce-xfstests

If you can upgrade to a non-antique kernel, though, I think you'll
save yourself much more time.  It may be that using kvm-xfstests or
gce-xfstests to demonstrate how unstable 3.7.2 might be helpful in
pursuading your management to let you upgrade to something a bit more
recent.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html