lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 11 Apr 2018 15:52:44 -0600
From:   Andreas Dilger <>
Cc:     "Theodore Y. Ts'o" <>,
        Ext4 Developers List <>,
        Linux FS Devel <>,
        Jeff Layton <>,
        "Joshua D. Drake" <>,
Subject: Re: fsync() errors is unsafe and risks data loss

On Apr 10, 2018, at 4:07 PM, Andres Freund <> wrote:
> 2018-04-10 18:43:56 Ted wrote:
>> So for better or for worse, there has not been as much investment in
>> buffered I/O and data robustness in the face of exception handling of
>> storage devices.
> That's a bit of a cop out. It's not just databases that care. Even more
> basic tools like SCM, package managers and editors care whether they can
> proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
to thousands(?) of files, closing the file descriptors, then expecting
fsync() on a newly-opened fd to return a historical error.  If an editor
tries to write a file, then calls fsync and gets an error, the user will
enter a new pathname and retry the write.  The package manager will assume
the package installation failed, and uninstall the parts of the package
that were already written.

There is no way the filesystem can handle the package manager failure case,
and keeping the pages dirty and retrying indefinitely may never work (e.g.
disk is dead or disconnected, is a sparse volume without any free space,
etc).  This (IMHO) implies that the higher layer (which knows more about
what the write failure implies) needs to deal with this.

> 2018-04-10 18:43:56 Ted wrote:
>> So this is the explanation for why Linux handles I/O errors by
>> clearing the dirty bit after reporting the error up to user space.
>> And why there is not eagerness to solve the problem simply by "don't
>> clear the dirty bit".  For every one Postgres installation that might
>> have a better recover after an I/O error, there's probably a thousand
>> clueless Fedora and Ubuntu users who will have a much worse user
>> experience after a USB stick pull happens.
> I don't think these necessarily are as contradictory goals as you paint
> them.  At least in postgres' case we can deal with the fact that an
> fsync retry isn't going to fix the problem by reentering crash recovery
> or just shutting down - therefore we don't need to keep all the dirty
> buffers around.  A per-inode or per-superblock bit that causes further
> fsyncs to fail would be entirely sufficent for that.

> While there's some differing opinions on the referenced postgres thread,
> the fundamental problem isn't so much that a retry won't fix the
> problem, it's that we might NEVER see the failure.  If writeback happens
> in the background, encounters an error, undirties the buffer, we will
> happily carry on because we've never seen that. That's when we're
> majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened"
and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback *without an open file descriptor*,
there is no mechanism for the kernel to return an error to any application
which may exist, or may not ever come back.

Consider if there was a per-inode "there was once an error writing this
inode" flag.  Then fsync() would return an error on the inode forever,
since there is no way in POSIX to clear this state, since it would need
to be kept in case some new fd is opened on the inode and does an fsync()
and wants the error to be returned.

IMHO, the only alternative would be to keep the dirty pages in memory
until they are written to disk.  If that was not possible, what then?
It would need a reboot to clear the dirty pages, or truncate the file
(discarding all data)?

> Both in postgres, *and* a lot of other applications, it's not at all
> guaranteed to consistently have one FD open for every file
> written. Therefore even the more recent per-fd errseq logic doesn't
> guarantee that the failure will ever be seen by an application
> diligently fsync()ing.

... only if the application closes all fds for the file before calling
fsync.  If any fd is kept open from the time of the failure, it will
return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever.  You could put them
into a shared pool, and re-use them if the file is "re-opened", and call
fsync on each fd before it is closed (because the pool is getting too big
or because you want to flush the data for that file, or shut down the DB).
That wouldn't require a huge re-architecture of PG, just a small library
to handle the shared fd pool.

That might even improve performance, because opening and closing files
is itself not free, especially if you are working with remote filesystems.

> You'd not even need to have per inode information or such in the case
> that the block device goes away entirely. As the FS isn't generally
> unmounted in that case, you could trivially keep a per-mount (or
> superblock?) bit that says "I died" and set that instead of keeping per
> inode/whatever information.

The filesystem will definitely return an error in this case, I don't
think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
        if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
                return -EIO;

> 2018-04-10 18:43:56 Ted wrote:
>> If you are aware of a company who is willing to pay to have a new
>> kernel feature implemented to meet your needs, we might be able to
>> refer you to a company or a consultant who might be able to do that
>> work.
> I find it a bit dissapointing response. I think it's fair to say that
> for advanced features, but we're talking about the basic guarantee that
> fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or
are paid to develop it for the needs of others.  Everyone already has
too much work to do, so you need to find someone who has an interest
in fixing this (IMHO very peculiar) use case.  If PG developers want
to add a tunable "keep dirty pages in RAM on IO failure", I don't think
that it would be too hard for someone to do.  It might be harder to
convince some of the kernel maintainers to accept it, and I've been on
the losing side of that battle more than once.  However, like everything
you don't pay for, you can't require someone else to do this for you.
It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches,
would be interested to work on something like this.

That said, _even_ if a fix was available for Linux tomorrow, it would
be *years* before a majority of users would have it available on their
system, that includes even the errseq mechanism that was landed a few
months ago.  That implies to me that you'd want something that fixes PG
*now* so that it works around whatever (perceived) breakage exists in
the Linux fsync() implementation.  Since the thread indicates that
non-Linux kernels have the same fsync() behaviour, it makes sense to do
that even if the Linux fix was available.

> 2018-04-10 19:44:48 Andreas wrote:
>> The confusion is whether fsync() is a "level" state (return error
>> forever if there were pages that could not be written), or an "edge"
>> state (return error only for any write failures since the previous
>> fsync() call).
> I don't think that's the full issue. We can deal with the fact that an
> fsync failure is edge-triggered if there's a guarantee that every
> process doing so would get it.  The fact that one needs to have an FD
> open from before any failing writes occurred to get a failure, *THAT'S*
> the big issue.
> Beyond postgres, it's a pretty common approach to do work on a lot of
> files without fsyncing, then iterate over the directory fsync
> everything, and *then* assume you're safe. But unless I severaly
> misunderstand something that'd only be safe if you kept an FD for
> every file open, which isn't realistic for pretty obvious reasons.

I can't say how common or uncommon such a workload is, though PG is the
only application that I've heard of doing it, and I've been working on
filesystems for 20 years.  I'm a bit surprised that anyone expects
fsync() on a newly-opened fd to have any state from write() calls that
predate the open.  I can understand fsync() returning an error for any
IO that happens within the context of that fsync(), but how far should
it go back for reporting errors on that file?  Forever?   The only
way to clear the error would be to reboot the system, since I'm not
aware of any existing POSIX code to clear such an error.

> 2018-04-10 19:44:48 Andreas wrote:
>> I think Anthony Iliopoulos was pretty clear in his multiple
>> descriptions in that thread of why the current behaviour is needed
>> (OOM of the whole system if dirty pages are kept around forever), but
>> many others were stuck on "I can't believe this is happening??? This
>> is totally unacceptable and every kernel needs to change to match my
>> expectations!!!" without looking at the larger picture of what is
>> practical to change and where the issue should best be fixed.
> Everone can participate in discussions...
> Greetings,
> Andres Freund

Cheers, Andreas

Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)

Powered by blists - more mailing lists