lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 11 Apr 2018 19:17:52 -0700
From:   Andres Freund <>
To:     Andreas Dilger <>
        "Theodore Y. Ts'o" <>,
        Ext4 Developers List <>,
        Linux FS Devel <>,
        Jeff Layton <>,
        "Joshua D. Drake" <>
Subject: Re: fsync() errors is unsafe and risks data loss


On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:
> On Apr 10, 2018, at 4:07 PM, Andres Freund <> wrote:
> > 2018-04-10 18:43:56 Ted wrote:
> >> So for better or for worse, there has not been as much investment in
> >> buffered I/O and data robustness in the face of exception handling of
> >> storage devices.
> > 
> > That's a bit of a cop out. It's not just databases that care. Even more
> > basic tools like SCM, package managers and editors care whether they can
> > proper responses back from fsync that imply things actually were synced.
> Sure, but it is mostly PG that is doing (IMHO) crazy things like writing
> to thousands(?) of files, closing the file descriptors, then expecting
> fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros)
to take an example I just randomly guessed, does too:
  /* We want to guarantee the extracted files are on the disk, so that the
   * subsequent renames to the info database do not end up with old or zero
   * length files in case of a system crash. As neither dpkg-deb nor tar do
   * explicit fsync()s, we have to do them here.
   * XXX: This could be avoided by switching to an internal tar extractor. */

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise
entirely infeasible to fsync() every single file individually - the
performance becomes entirely attrocious if you do that.

I think there's some legitimate arguments that a database should use
direct IO (more on that as a reply to David), but claiming that all
sorts of random utilities need to use DIO with buffering etc is just

> If an editor tries to write a file, then calls fsync and gets an
> error, the user will enter a new pathname and retry the write.  The
> package manager will assume the package installation failed, and
> uninstall the parts of the package that were already written.

Except that they won't notice that they got a failure, at least in the
dpkg case. And happily continue installing corrupted data

> There is no way the filesystem can handle the package manager failure case,
> and keeping the pages dirty and retrying indefinitely may never work (e.g.
> disk is dead or disconnected, is a sparse volume without any free space,
> etc).  This (IMHO) implies that the higher layer (which knows more about
> what the write failure implies) needs to deal with this.

Yea, I agree that'd not be sane. As far as I understand the dpkg code
(all of 10min reading it), that'd also be unnecessary. It can abort the
installation, but only if it detects the error.  Which isn't happening.

> > While there's some differing opinions on the referenced postgres thread,
> > the fundamental problem isn't so much that a retry won't fix the
> > problem, it's that we might NEVER see the failure.  If writeback happens
> > in the background, encounters an error, undirties the buffer, we will
> > happily carry on because we've never seen that. That's when we're
> > majorly screwed.
> I think there are two issues here - "fsync() on an fd that was just opened"
> and "persistent error state (without keeping dirty pages in memory)".
> If there is background data writeback *without an open file descriptor*,
> there is no mechanism for the kernel to return an error to any application
> which may exist, or may not ever come back.

And that's *horrible*. If I cp a file, and writeback fails in the
background, and I then cat that file before restarting, I should be able
to see that that failed. Instead of returning something bogus.

Or even more extreme, you untar/zip/git clone a directory. Then do a
sync. And you don't know whether anything actually succeeded.

> Consider if there was a per-inode "there was once an error writing this
> inode" flag.  Then fsync() would return an error on the inode forever,
> since there is no way in POSIX to clear this state, since it would need
> to be kept in case some new fd is opened on the inode and does an fsync()
> and wants the error to be returned.

The data in the file also is corrupt. Having to unmount or delete the
file to reset the fact that it can't safely be assumed to be on disk
isn't insane.

> > Both in postgres, *and* a lot of other applications, it's not at all
> > guaranteed to consistently have one FD open for every file
> > written. Therefore even the more recent per-fd errseq logic doesn't
> > guarantee that the failure will ever be seen by an application
> > diligently fsync()ing.
> ... only if the application closes all fds for the file before calling
> fsync.  If any fd is kept open from the time of the failure, it will
> return the original error on fsync() (and then no longer return it).
> It's not that you need to keep every fd open forever.  You could put them
> into a shared pool, and re-use them if the file is "re-opened", and call
> fsync on each fd before it is closed (because the pool is getting too big
> or because you want to flush the data for that file, or shut down the DB).
> That wouldn't require a huge re-architecture of PG, just a small library
> to handle the shared fd pool.

Except that postgres uses multiple processes. And works on a lot of
architectures.  If we started to fsync all opened files on process exit
our users would *lynch* us.  We'd need a complicated scheme that sends
processes across sockets between processes, then deduplicate them on the
receiving side, somehow figuring out which is the oldest filedescriptors
(handling clockdrift safely).

Note that it'd be perfectly fine that we've "thrown away" the buffer
contents if we'd get notified that the fsync failed. We could just do
WAL replay, and restore the contents (just was we do after crashes
and/or for replication).

> That might even improve performance, because opening and closing files
> is itself not free, especially if you are working with remote filesystems.

There's already a per-process cache of open files.

> > You'd not even need to have per inode information or such in the case
> > that the block device goes away entirely. As the FS isn't generally
> > unmounted in that case, you could trivially keep a per-mount (or
> > superblock?) bit that says "I died" and set that instead of keeping per
> > inode/whatever information.
> The filesystem will definitely return an error in this case, I don't
> think this needs any kind of changes:
> int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> {
>         if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>                 return -EIO;

Well, I'm making that argument because several people argued that
throwing away buffer contents in this case is the only way to not cause
OOMs, and that that's incompatible with reporting errors. It's clearly

> > 2018-04-10 18:43:56 Ted wrote:
> >> If you are aware of a company who is willing to pay to have a new
> >> kernel feature implemented to meet your needs, we might be able to
> >> refer you to a company or a consultant who might be able to do that
> >> work.
> > 
> > I find it a bit dissapointing response. I think it's fair to say that
> > for advanced features, but we're talking about the basic guarantee that
> > fsync actually does something even remotely reasonable.
> Linux (as PG) is run by people who develop it for their own needs, or
> are paid to develop it for the needs of others.


> Everyone already has too much work to do, so you need to find someone
> who has an interest in fixing this (IMHO very peculiar) use case.  If
> PG developers want to add a tunable "keep dirty pages in RAM on IO
> failure", I don't think that it would be too hard for someone to do.
> It might be harder to convince some of the kernel maintainers to
> accept it, and I've been on the losing side of that battle more than
> once.  However, like everything you don't pay for, you can't require
> someone else to do this for you.  It wouldn't hurt to see if Jeff
> Layton, who wrote the errseq patches, would be interested to work on
> something like this.

I don't think this is that PG specific, as explained above.


Andres Freund

Powered by blists - more mailing lists