[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180414023814.GB997@bombadil.infradead.org>
Date: Fri, 13 Apr 2018 19:38:14 -0700
From: Matthew Wilcox <willy@...radead.org>
To: Dave Chinner <david@...morbit.com>
Cc: Jeff Layton <jlayton@...hat.com>,
lsf-pc <lsf-pc@...ts.linuxfoundation.org>,
Andres Freund <andres@...razel.de>,
Andreas Dilger <adilger@...ger.ca>,
"Theodore Y. Ts'o" <tytso@....edu>,
Ext4 Developers List <linux-ext4@...r.kernel.org>,
Linux FS Devel <linux-fsdevel@...r.kernel.org>,
"Joshua D. Drake" <jd@...mandprompt.com>
Subject: Re: fsync() errors is unsafe and risks data loss
On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:
> On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:
> > 1. If we get an error while wbc->for_background is true, we should not clear
> > uptodate on the page, rather SetPageError and SetPageDirty.
>
> So you're saying we should treat it as a transient error rather than
> a permanent error.
Yes, I'm proposing leaving the data in memory in case the user wants to
try writing it somewhere else.
> > 2. Background writebacks should skip pages which are PageError.
>
> That seems decidedly dodgy in the case where there is a transient
> error - it requires a user to specifically run sync to get the data
> to disk after the transient error has occurred. Say they don't
> notice the problem because it's fleeting and doesn't cause any
> obvious problems?
That's fair. What I want to avoid is triggering the same error every
30 seconds (or whatever the periodic writeback threshold is set to).
> e.g. XFS gets to enospc, runs out of reserve pool blocks so can't
> allocate space to write back the page, then space is freed up a few
> seconds later and so the next write will work just fine.
>
> This is a recipe for "I lost data that I wrote /days/ before the
> system crashed" bug reports.
So ... exponential backoff on retries?
> > 3. for_sync writebacks should attempt one last write. Maybe it'll
> > succeed this time. If it does, just ClearPageError. If not, we have
> > somebody to report this writeback error to, and ClearPageUptodate.
>
> Which may well be unmount. Are we really going to wait until unmount
> to report fatal errors?
Goodness, no. The errors would be immediately reportable using the wb_err
mechanism, as soon as the first error was encountered.
Powered by blists - more mailing lists