linux-ext4 - Re: [ext4 io hang] buffered write io hang in balance_dirty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZEyL/sjVeW88XpIn@ovpn-8-24.pek2.redhat.com>
Date:   Sat, 29 Apr 2023 11:16:14 +0800
From:   Ming Lei <ming.lei@...hat.com>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Baokun Li <libaokun1@...wei.com>,
        Matthew Wilcox <willy@...radead.org>,
        linux-ext4@...r.kernel.org,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        linux-block@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
        Dave Chinner <dchinner@...hat.com>,
        Eric Sandeen <sandeen@...hat.com>,
        Christoph Hellwig <hch@....de>, Zhang Yi <yi.zhang@...hat.com>,
        yangerkun <yangerkun@...wei.com>, ming.lei@...hat.com
Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages

On Fri, Apr 28, 2023 at 01:47:22AM -0400, Theodore Ts'o wrote:
> On Fri, Apr 28, 2023 at 11:47:26AM +0800, Baokun Li wrote:
> > Ext4 just detects I/O Error and remounts it as read-only, it doesn't know
> > if the current disk is dead or not.
> > 
> > I asked Yu Kuai and he said that disk_live() can be used to determine
> > whether
> > a disk has been removed based on the status of the inode corresponding to
> > the block device, but this is generally not done in file systems.
> 
> What really needs to happen is that del_gendisk() needs to inform file
> systems that the disk is gone, so that the file system can shutdown
> the file system and tear everything down.

OK, looks both Dave and you have same suggestion, and IMO, it isn't hard to
add one interface for notifying FS, and it can be either one s_ops->shutdown()
or shutdown_filesystem(struct super_block *sb).

But the main job should be how this interface is implemented in FS/VFS side,
so it looks one more FS job, and block layer can call shutdown_filesystem()
from del_gendisk() simply.

> 
> disk_live() is relatively new; it was added in August 2021.  Back in

IO failure plus checking disk_live() could be one way for handling the
failure, but this kind of interface isn't friendly.

> 2015, I had added the following in fs/ext4/super.c:
> 
> /*
>  * The del_gendisk() function uninitializes the disk-specific data
>  * structures, including the bdi structure, without telling anyone
>  * else.  Once this happens, any attempt to call mark_buffer_dirty()
>  * (for example, by ext4_commit_super), will cause a kernel OOPS.
>  * This is a kludge to prevent these oops until we can put in a proper
>  * hook in del_gendisk() to inform the VFS and file system layers.
>  */
> static int block_device_ejected(struct super_block *sb)
> {
> 	struct inode *bd_inode = sb->s_bdev->bd_inode;
> 	struct backing_dev_info *bdi = inode_to_bdi(bd_inode);
> 
> 	return bdi->dev == NULL;
> }
> 
> As the comment states, it's rather awkward to have the file system
> check to see if the block device is dead in various places; the real

I can understand the awkward, :-(

bdi_unregister() is called in del_gendisk(), since bdi_register() has
to be called in add_disk() where major/minor is figured out.

> problem is that the block device shouldn't just *vanish*, with the

That looks not realistic, removable disk can be gone any time, and device
driver error handler often deletes disk as the last straw, and it shouldn't
be hard to observe such error.

Also it is not realistic to wait until all openers closes the bdev, given
it may wait forever.

> block device structures egetting partially de-initialized, without the
> block layer being polite enough to let the file system know.

Block device & gendisk instance won't be gone if the bdev is opened, and
I guess it is just few fields deinitialized, such as bdi->dev, bdi could be
the only one used by FS code. 

> 
> > Those dirty pages that are already there are piling up and can't be
> > written back, which I think is a real problem. Can the block layer
> > clear those dirty pages when it detects that the disk is deleted?
> 
> Well, the dirty pages belong to the file system, and so it needs to be
> up to the file system to clear out the dirty pages.  But I'll also
> what the right thing to do when a disk gets removed is not necessarily
> obvious.

Yeah, clearing dirty pages doesn't belong to block layer.

> 
> For example, suppose some process has a file mmap'ed into its address
> space, and that file is on the disk which the user has rudely yanked
> out from their laptop; what is the right thing to do?  Do we kill the
> process?  Do we let the process write to the mmap'ed region, and
> silently let the modified data go *poof* when the process exits?  What
> if there is an executable file on the removable disk, and there are
> one or more processes running that executable when the device
> disappears?  Do we kill the process?  Do we let the process run unti
> it tries to access a page which hasn't been paged in and then kill the
> process?
> 
> We should design a proper solution for What Should Happen when a
> removable disk gets removed unceremoniously without unmounting the
> file system first.  It's not just a matter of making some tests go
> green....

Agree, the trouble is actually in how FS to handle the disk removal.


Thanks,
Ming