linux-ext4 - Re: [LSF TOPIC] online repair of filesystems: what next?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230309160000.GC1637786@frogsfrogsfrogs>
Date:   Thu, 9 Mar 2023 08:00:00 -0800
From:   "Darrick J. Wong" <djwong@...nel.org>
To:     Dave Chinner <david@...morbit.com>
Cc:     Jan Kara <jack@...e.cz>, lsf-pc@...ts.linux-foundation.org,
        linux-fsdevel@...r.kernel.org, xfs <linux-xfs@...r.kernel.org>,
        linux-ext4 <linux-ext4@...r.kernel.org>,
        linux-btrfs <linux-btrfs@...r.kernel.org>
Subject: Re: [LSF TOPIC] online repair of filesystems: what next?

On Thu, Mar 09, 2023 at 08:54:39AM +1100, Dave Chinner wrote:
> On Wed, Mar 08, 2023 at 06:12:06PM +0100, Jan Kara wrote:
> > Hi!
> > 
> > I'm interested in this topic. Some comments below.
> > 
> > On Tue 28-02-23 12:49:03, Darrick J. Wong wrote:
> > > Five years ago[0], we started a conversation about cross-filesystem
> > > userspace tooling for online fsck.  I think enough time has passed for
> > > us to have another one, since a few things have happened since then:
> > > 
> > > 1. ext4 has gained the ability to send corruption reports to a userspace
> > >    monitoring program via fsnotify.  Thanks, Collabora!
> > > 
> > > 2. XFS now tracks successful scrubs and corruptions seen during runtime
> > >    and during scrubs.  Userspace can query this information.
> > > 
> > > 3. Directory parent pointers, which enable online repair of the
> > >    directory tree, is nearing completion.
> > > 
> > > 4. Dave and I are working on merging online repair of space metadata for
> > >    XFS.  Online repair of directory trees is feature complete, but we
> > >    still have one or two unresolved questions in the parent pointer
> > >    code.
> > > 
> > > 5. I've gotten a bit better[1] at writing systemd service descriptions
> > >    for scheduling and performing background online fsck.
> > > 
> > > Now that fsnotify_sb_error exists as a result of (1), I think we
> > > should figure out how to plumb calls into the readahead and writeback
> > > code so that IO failures can be reported to the fsnotify monitor.  I
> > > suspect there may be a few difficulties here since fsnotify (iirc)
> > > allocates memory and takes locks.
> > 
> > Well, if you want to generate fsnotify events from an interrupt handler,
> > you're going to have a hard time, I don't have a good answer for that.
> 
> I don't think we ever do that, or need to do that. IO completions
> that can throw corruption errors are already running in workqueue
> contexts in XFS.
> 
> Worst case, we throw all bios that have IO errors flagged to the
> same IO completion workqueues, and the problem of memory allocation,
> locks, etc in interrupt context goes away entire.

Indeed.  For XFS I think the only time we might need to fsnotify about
errors from interrupt context is writeback completions for a pure
overwrite?  We could punt those to a workqueue as Dave says.  Or figure
out a way for whoever's initiating writeback to send it for us?

I think this is a general issue for the pagecache, not XFS.  I'll
brainstorm with willy the next time I encounter him.

> > But
> > offloading of error event generation to a workqueue should be doable (and
> > event delivery is async anyway so from userspace POV there's no
> > difference).
> 
> Unless I'm misunderstanding you (possible!), that requires a memory
> allocation to offload the error information to the work queue to
> allow the fsnotify error message to be generated in an async manner.
> That doesn't seem to solve anything.
> 
> > Otherwise locking shouldn't be a problem AFAICT. WRT memory
> > allocation, we currently preallocate the error events to avoid the loss of
> > event due to ENOMEM. With current usecases (filesystem catastrophical error
> > reporting) we have settled on a mempool with 32 preallocated events (note
> > that preallocated event gets used only if normal kmalloc fails) for
> > simplicity. If the error reporting mechanism is going to be used
> > significantly more, we may need to reconsider this but it should be doable.
> > And frankly if you have a storm of fs errors *and* the system is going
> > ENOMEM at the same time, I have my doubts loosing some error report is
> > going to do any more harm ;).
> 
> Once the filesystem is shut down, it will need to turn off
> individual sickness notifications because everything is sick at this
> point.

I was thinking that the existing fsnotify error set should adopt a 'YOUR
FS IS DEAD' notification.  Then when the fs goes down due to errors or
the shutdown ioctl, we can broadcast that as the final last gasp of the
filesystem.

> > > As a result of (2), XFS now retains quite a bit of incore state about
> > > its own health.  The structure that fsnotify gives to userspace is very
> > > generic (superblock, inode, errno, errno count).  How might XFS export
> > > a greater amount of information via this interface?  We can provide
> > > details at finer granularity -- for example, a specific data structure
> > > under an allocation group or an inode, or specific quota records.
> > 
> > Fsnotify (fanotify in fact) interface is fairly flexible in what can be
> > passed through it. So if you need to pass some (reasonably short) binary
> > blob to userspace which knows how to decode it, fanotify can handle that
> > (with some wrapping). Obviously there's a tradeoff to make how much of the
> > event is generic (as that is then easier to process by tools common for all
> > filesystems) and how much is fs specific (which allows to pass more
> > detailed information). But I guess we need to have concrete examples of
> > events to discuss this.
> 
> Fine grained health information will always be filesystem specific -
> IMO it's not worth trying to make it generic when there is only one
> filesystem that tracking and exporting fine-grained health
> information. Once (if) we get multiple filesystems tracking fine
> grained health information, then we'll have the information we need
> to implement a useful generic set of notifications, but until then I
> don't think we should try.

Same here.  XFS might want to send the generic notifications and follow
them up with more specific information?

> We should just export the notifications the filesystem utilities
> need to do their work for the moment.  When management applications
> (e.g Stratis) get to the point where they can report/manage
> filesystem health and need that information from multiple
> filesystems types, then we can work out a useful common subset of
> fine grained events across those filesystems that the applications
> can listen for.

If someone wants to write xfs_scrubd that listens for events and issues
XFS_IOC_SCRUB_METADATA calls I'd be all ears. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@...morbit.com