linux-ext4 - Re: [BUG] fstests generic/388 is failing with fsck errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20180807163302.GA28888@thunk.org>
Date:   Tue, 7 Aug 2018 12:33:02 -0400
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Josef Bacik <josef@...icpanda.com>
Cc:     linux-ext4@...r.kernel.org, kernel-team@...com
Subject: Re: [BUG] fstests generic/388 is failing with fsck errors

On Tue, Aug 07, 2018 at 09:57:45AM -0400, Josef Bacik wrote:
> Hello,
> 
> Our automated test infrastructure has been throwing errors while running
> generic/388 on upstream kernels for a little bit.  We run the same tests on our
> internal kernels and it doesn't fail on any of our variations, the most recent
> of which is 4.16 based, so it's relatively new.  I reproduced locally to make
> sure it wasn't a fluke, and it took 3 runs but I hit it as well.

It's been around as long as we added shutdown FS_IOC_SHUTDOWN support.
You might not have noticed it because it's a race which very much
depends speed of the device.  generic/388 will run in a loop (N
times), fstress and then force a shutdown, and then run fsck.
Sometimes, the file system will be have a corruption.

It's on my todo list to fix, but the original use case was for scratch
file systems that are mounted over remote block device like iSCSI, and
if for some reason the iSCSI server stops responding, we are using the
shutdown ioctl to take down the mount more quickly.  Since it's for a
scratch file system where the iSCSI device is ephemeral (and by the
time we shut it down, it's toast), the question of whether the file
system will be consistent afterwards really doesn't matter.

Also, the obvious fixes would destroy ext4's scalability, and I'm not
aware of anyone except for us at Google using the shutdown ioctl in
production (at least not for ext4, and I doubt it's commonly used for
most file systems), so it's been low priority for me to really set
aside time to tackle.

> I'm not sure where it got introduced, I'm running a bisect now to try and figure
> out where it happened but I wanted to let you know ASAP.  Thanks,

I just want to save you some time when I say --- don't bother.  The
failure was known when the shutdown code was first added, and I have
records of it failing go back to 4.10.  The race doesn't always
trigger, so your trying to bisect it will probably lead to a lot of
frustration.

Cheers,

						- Ted