linux-ext4 - Re: fsync() errors is unsafe and risks data loss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180412201322.77igwnxfqbmnsxkf@alap3.anarazel.de>
Date:   Thu, 12 Apr 2018 13:13:22 -0700
From:   Andres Freund <andres@...razel.de>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Dave Chinner <david@...morbit.com>,
        Jeff Layton <jlayton@...hat.com>,
        Matthew Wilcox <willy@...radead.org>,
        Andreas Dilger <adilger@...ger.ca>,
        20180410184356.GD3563@...nk.org,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        Linux FS Devel <linux-fsdevel@...r.kernel.org>,
        "Joshua D. Drake" <jd@...mandprompt.com>
Subject: Re: fsync() errors is unsafe and risks data loss

Hi,

On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote:
> That's the problem.  The best that could be done (and it's not enough)
> would be to have a mode which does with the PG folks want (or what
> they *think* they want).  It seems what they want is to have an error
> result in the page being marked clean.  When they discover the outcome
> (OOM-city and the unability to unmount a file system on a failed
> drive), then they will complain to us *again*, at which point we can
> tell them that want they really want is another variation on O_PONIES,
> and welcome to the real world and real life.

I think a per-file or even per-blockdev/fs error state that'd be
returned by fsync() would be more than sufficient.  I don't see that
that'd realistically would trigger OOM or the inability to unmount a
filesystem.  If the drive is entirely gone there's obviously no point in
keeping per-file information around, so per-blockdev/fs information
suffices entirely to return an error on fsync (which at least on ext4
appears to happen if the underlying blockdev is gone).

Have fun making up things we want, but I'm not sure it's particularly
productive.

> Which is why, even if they were to pay someone to implement what they
> want, I'm not sure we would want to accept it upstream --- or distro's
> might consider it a support nightmare, and refuse to allow that mode
> to be enabled on enterprise distro's.  But at least, it will have been
> some PG-based company who will have implemented it, so they're not
> wasting other people's time or other people's resources...

Well, that's why I'm discussing here so we can figure out what's
acceptable before considering wasting money and revew cycles doing or
paying somebody to do some crazy useless shit.

> We could try to get something like what Google is doing upstream,
> which is to have the I/O errors sent to userspace via a netlink
> channel (without changing anything else about how buffered writeback
> is handled in the face of errors).

Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd
be upstream. And yes, that'd be perfect.

> Then userspace applications could switch to Direct I/O like all of the
> other really serious userspace storage solutions I'm aware of, and
> then someone could try to write some kind of HDD health monitoring
> system that tries to do the right thing when a disk is discovered to
> have developed some media errors or something more serious (e.g., a
> head failure).  That plus some kind of RAID solution is I think the
> only thing which is really realistic for a typical PG site.

As I said earlier, I think there's good reason to move to DIO for
postgres. But to keep that performant is going to need some serious
work.

But afaict such a solution wouldn't really depend on applications using
DIO or not. Before finishing a checkpoint (logging it persistently and
allowing to throw older data away), we could check if any errors have
been reported and give up if there have been any.  And after starting
postgres on a directory restored from backup using $tool, we can fsync
the directory recursively, check for such errors, and give up if
there've been any.

Greetings,

Andres Freund