linux-kernel - Bug in fua code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20161018045250.fco7bhcrapt2ai2f@kmo-pixel>
Date:   Mon, 17 Oct 2016 20:52:50 -0800
From:   Kent Overstreet <kent.overstreet@...il.com>
To:     Ming Lei <ming.lei@...onical.com>, Jens Axboe <axboe@...com>
Cc:     linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Bug in fua code

Ming,

I recently discovered a bug in the FUA code - a recent bcachefs change exposed
it - and my best guess is it's related to your recent changes to blk-flush.c.

What I'm seeing is if all writes are issued as FUA writes, in a short period of
time the request queue get stuck - writes are on the queue but they aren't being
issued or completed. This is with an AHCI device - so no blk-mq, and it's
emulating FUA with flushes.

You ought to be able to reproduce this yourself by changing
generic_make_request() to make all writes FUA, and then just doing O_DIRECT
writes with dd or something. I suspect that if there's non FUA flushes being
issued they'll end up kicking the queue and keeping things from getting stuck,
in my testing I'm only seeing things get completely stuck when testing bcachefs
in multi device mode, with no metadata or journal IO to the device in question,
just FUA data writes.

After things get stuck, with kgdb I'm seeing a request on the request queue that
has flush_data_end_io for its endio function. I've still been trying to figure
out how the flush machinery is supposed to work, I don't know what else you'd
want to know.

Much appreciated if you could take a look.