[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb7c3b1c-b5c0-4078-9a88-327f1220cae8@gmx.com>
Date: Wed, 9 Jul 2025 10:25:08 +0930
From: Qu Wenruo <quwenruo.btrfs@....com>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: Dave Chinner <david@...morbit.com>, Christian Brauner
<brauner@...nel.org>, "Darrick J. Wong" <djwong@...nel.org>,
Qu Wenruo <wqu@...e.com>, linux-btrfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, viro@...iv.linux.org.uk, jack@...e.cz,
linux-ext4@...r.kernel.org, linux-f2fs-devel@...ts.sourceforge.net,
ntfs3@...ts.linux.dev, linux-xfs@...r.kernel.org
Subject: Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to
remove_bdev()
在 2025/7/9 10:05, Kent Overstreet 写道:
> On Wed, Jul 09, 2025 at 08:37:05AM +0930, Qu Wenruo wrote:
>> 在 2025/7/9 08:29, Dave Chinner 写道:
>>> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
>>>> On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
>>>>> On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2025/7/8 08:32, Dave Chinner 写道:
>>>>>>> On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
>>>>>>>> Currently all the filesystems implementing the
>>>>>>>> super_opearations::shutdown() callback can not afford losing a device.
>>>>>>>>
>>>>>>>> Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
>>>>>>>> involved filesystem.
>>>>>>>>
>>>>>>>> But it will no longer be the case, with multi-device filesystems like
>>>>>>>> btrfs and bcachefs the filesystem can handle certain device loss without
>>>>>>>> shutting down the whole filesystem.
>>>>>>>>
>>>>>>>> To allow those multi-device filesystems to be integrated to use
>>>>>>>> fs_holder_ops:
>>>>>>>>
>>>>>>>> - Replace super_opearation::shutdown() with
>>>>>>>> super_opearations::remove_bdev()
>>>>>>>> To better describe when the callback is called.
>>>>>>>
>>>>>>> This conflates cause with action.
>>>>>>>
>>>>>>> The shutdown callout is an action that the filesystem must execute,
>>>>>>> whilst "remove bdev" is a cause notification that might require an
>>>>>>> action to be take.
>>>>>>>
>>>>>>> Yes, the cause could be someone doing hot-unplug of the block
>>>>>>> device, but it could also be something going wrong in software
>>>>>>> layers below the filesystem. e.g. dm-thinp having an unrecoverable
>>>>>>> corruption or ENOSPC errors.
>>>>>>>
>>>>>>> We already have a "cause" notification: blk_holder_ops->mark_dead().
>>>>>>>
>>>>>>> The generic fs action that is taken by this notification is
>>>>>>> fs_bdev_mark_dead(). That action is to invalidate caches and shut
>>>>>>> down the filesystem.
>>>>>>>
>>>>>>> btrfs needs to do something different to a blk_holder_ops->mark_dead
>>>>>>> notification. i.e. it needs an action that is different to
>>>>>>> fs_bdev_mark_dead().
>>>>>>>
>>>>>>> Indeed, this is how bcachefs already handles "single device
>>>>>>> died" events for multi-device filesystems - see
>>>>>>> bch2_fs_bdev_mark_dead().
>>>>>>
>>>>>> I do not think it's the correct way to go, especially when there is already
>>>>>> fs_holder_ops.
>>>>>>
>>>>>> We're always going towards a more generic solution, other than letting the
>>>>>> individual fs to do the same thing slightly differently.
>>>>>
>>>>> On second thought -- it's weird that you'd flush the filesystem and
>>>>> shrink the inode/dentry caches in a "your device went away" handler.
>>>>> Fancy filesystems like bcachefs and btrfs would likely just shift IO to
>>>>> a different bdev, right? And there's no good reason to run shrinkers on
>>>>> either of those fses, right?
>>>>>
>>>>>> Yes, the naming is not perfect and mixing cause and action, but the end
>>>>>> result is still a more generic and less duplicated code base.
>>>>>
>>>>> I think dchinner makes a good point that if your filesystem can do
>>>>> something clever on device removal, it should provide its own block
>>>>> device holder ops instead of using fs_holder_ops. I don't understand
>>>>> why you need a "generic" solution for btrfs when it's not going to do
>>>>> what the others do anyway.
>>>>
>>>> I think letting filesystems implement their own holder ops should be
>>>> avoided if we can. Christoph may chime in here. I have no appettite for
>>>> exporting stuff like get_bdev_super() unless absolutely necessary. We
>>>> tried to move all that handling into the VFS to eliminate a slew of
>>>> deadlocks we detected and fixed. I have no appetite to repeat that
>>>> cycle.
>>>
>>> Except it isn't actually necessary.
>>>
>>> Everyone here seems to be assuming that the filesystem *must* take
>>> an active superblock reference to process a device removal event,
>>> and that is *simply not true*.
>>>
>>> bcachefs does not use get_bdev_super() or an active superblock
>>> reference to process ->mark_dead events.
>>
>> Yes, bcachefs doesn't go this path, but the reason is more important.
>>
>> Is it just because there is no such callback so that Kent has to come up his
>> own solution, or something else?
>>
>> If the former case, all your argument here makes no sense.
>>
>> Adding Kent here to see if he wants a more generic s_op->remove_bdev()
>> solution or the current each fs doing its own mark_dead() callback.
>
> Consider that the thing that has a block device open might not even be a
> filesystem, or at least a VFS filesystem.
>
> It could be a stacking block device driver - md or md - and those
> absolutely should be implementing .mark_dead() (likely passing it
> through on up the stack), or something else entirely.
>
> In bcachefs's case, we might not even have created the VFS super_block
> yet: we don't do that until after recovery finishes, and indeed we can't
> because creating a super_block and leaving it in !SB_BORN will cause
> such fun as sync calls to hang for the entire system...
>
Not related to the series, but IIRC if s_flags doesn't have SB_ACTIVE
set, things like bdev_super_lock() won't choose that superblock, thus
won't call ->sync() callback through the bdev callbacks.
And btrfs also follows the same scheme, only setting SB_ACTIVE after
everything is done (including replaying the log etc), and so far we
haven't yet hit such sync during mount.
Thanks,
Qu
Powered by blists - more mailing lists