linux-ext4 - Re: [ext4 io hang] buffered write io hang in balance_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZEyjY0W+8zafPAoh@mit.edu>
Date:   Sat, 29 Apr 2023 00:56:03 -0400
From:   "Theodore Ts'o" <tytso@....edu>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     Baokun Li <libaokun1@...wei.com>,
        Matthew Wilcox <willy@...radead.org>,
        linux-ext4@...r.kernel.org,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        linux-block@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        Dave Chinner <dchinner@...hat.com>,
        Eric Sandeen <sandeen@...hat.com>,
        Christoph Hellwig <hch@....de>, Zhang Yi <yi.zhang@...hat.com>,
        yangerkun <yangerkun@...wei.com>
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Sat, Apr 29, 2023 at 11:16:14AM +0800, Ming Lei wrote:
> 
> bdi_unregister() is called in del_gendisk(), since bdi_register() has
> to be called in add_disk() where major/minor is figured out.
> 
> > problem is that the block device shouldn't just *vanish*, with the
> 
> That looks not realistic, removable disk can be gone any time, and device
> driver error handler often deletes disk as the last straw, and it shouldn't
> be hard to observe such error.

It's not realistic to think that the file system can write back any
dirty pages, sure.  At this point, the user has already yanked out the
thumb drive, and the physical device is gone.  However, various fields
like bdi->dev shouldn't get deinitialized until after the
s_ops->shutdown() function has returned.

We need to give the file system a chance to shutdown any pending
writebacks; otherwise, we could be racing with writeback happening in
some other kernel thread, and while the I/O is certainly not going to
suceed, it would be nice if attempts to write to the block device
return an error, intead potentially causing the kernel to crash.

The shutdown function might need to sleep while it waits for
workqueues or kernel threads to exit, or while it iterates over all
inodes and clears all of the dirty bits and/or drop all of the pages
associated with the file system on the disconnected block device.  So
while this happens, I/O should just fail, and not result in a kernel
BUG or oops.

Once the s_ops->shutdown() has returned, then del_gendisk can shutdown
and/or deallocate anything it wants, and if the file system tries to
use the bdi after s_ops->shutdown() has returned, well, it deserves
anything it gets.

(Well, it would be nice if things didn't bug/oops in fs/buffer.c if
there is no s_ops->shutdown() function, since there are a lot of
legacy file systems that use the buffer cache and until we can add
some kind of generic shutdown function to fs/libfs.c and make sure
that all of the legacy file systems that are likely to be used on a
USB thumb drive are fixed, it would be nice if they were protected.
At the very least, we should make that things are no worse than they
currently are.)

       	    	 	       	     	  - Ted

P.S.  Note that the semantics I've described here for
s_ops->shutdown() are slightly different than what the FS_IOC_SHUTDOWN
ioctl currently does.  For example, after FS_IOC_SHUTDOWN, writes to
files will fail, but read to already open files will succeed.  I know
this because the original ext4 shutdown implementation did actually
prevent reads from going through, but we got objections from those
that wanted ext4's FS_IOC_SHUTDOWN to work the same way as xfs's.

So we have an out of tree patch for ext4's FS_IOC_SHUTDOWN
implementation in our kernels at $WORK, because we were using it when
we knew that the back-end server providing the iSCSI or remote block
had died, and we wanted to make sure our borg (think Kubernetes) jobs
would fast fail when they tried reading from the dead file system, as
opposed to failing only after some timeout had elapsed.

To avoid confusion, we should probably either use a different name
than s_ops->shutdown(), or add a new mode to FS_IOC_SHUTDOWN which
corresponds to "the block device is gone, shut *everything* down:
reads, writes, everything."  My preference would be the latter, since
it would mean we could stop carrying that out-of-tree patch in our
data center kernels...