linux-kernel - Re: POSIX violation by writeback error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <09ba078797a1327713e5c2d3111641246451c06e.camel@redhat.com>
Date:   Wed, 05 Sep 2018 06:55:15 -0400
From:   Jeff Layton <jlayton@...hat.com>
To:     焦晓冬 <milestonejxd@...il.com>
Cc:     bfields@...ldses.org, R.E.Wolff@...wizard.nl,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: POSIX violation by writeback error

On Wed, 2018-09-05 at 16:24 +0800, 焦晓冬 wrote:
> On Wed, Sep 5, 2018 at 4:18 AM Jeff Layton <jlayton@...hat.com> wrote:
> > 
> > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > > Well, I think the point was that in the above examples you'd prefer that
> > > > > the read just fail--no need to keep the data.  A bit marking the file
> > > > > (or even the entire filesystem) unreadable would satisfy posix, I guess.
> > > > > Whether that's practical, I don't know.
> > > > 
> > > > When you would do it like that (mark the whole filesystem as "in
> > > > error") things go from bad to worse even faster. The Linux kernel
> > > > tries to keep the system up even in the face of errors.
> > > > 
> > > > With that suggestion, having one application run into a writeback
> > > > error would effectively crash the whole system because the filesystem
> > > > may be the root filesystem and stuff like "sshd" that you need to
> > > > diagnose the problem needs to be read from the disk....
> > > 
> > > Well, the absolutist position on posix compliance here would be that a
> > > crash is still preferable to returning the wrong data.  And for the
> > > cases 焦晓冬 gives, that sounds right?  Maybe it's the wrong balance in
> > > general, I don't know.  And we do already have filesystems with
> > > panic-on-error options, so if they aren't used maybe then maybe users
> > > have already voted against that level of strictness.
> > > 
> > 
> > Yeah, idk. The problem here is that this is squarely in the domain of
> > implementation defined behavior. I do think that the current "policy"
> > (if you call it that) of what to do after a wb error is weird and wrong.
> > What we probably ought to do is start considering how we'd like it to
> > behave.
> > 
> > How about something like this?
> > 
> > Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> > reads from the cached data until someone calls fsync, at which point
> > we'd return the error and invalidate the uncleanable pages.
> 
> Totally agree with you.
> 
> > 
> > If no one calls fsync and scrapes the error, we'll hold on to it for as
> > long as we can (or up to some predefined limit) and then after that
> > we'll invalidate the uncleanable pages and start returning errors on
> > reads. If someone eventually calls fsync afterward, we can return to
> > normal operation.
> 
> Agree with you except that using fsync() as `clear_error_mark()` seems
> weird and counter-intuitive.
> 

That is essentially how fsync (and the errseq_t infrastructure) works. 

Once the kernel has hit a wb error, it reports that error to fsync
exactly once per fd. In practice, the errors are not "cleared", but it
appears that way to the fsync caller.

> > 
> > As always though...what about mmap? Would we need to SIGBUS at the point
> > where we'd start returning errors on read()?
> 
> I think SIGBUS to mmap() is the same thing as EIO to read().
> 
> > 
> > Would that approximate the current behavior enough and make sense?
> > Implementing it all sounds non-trivial though...
> 
> No.
> No problem is reported because nowadays we are relying on the
> underlying disk drives. They transparently redirect bad sectors and
> use S.M.A.R.T to waning us long before a real EIO could be seen.
> As to network filesystems, if I'm not wrong, close() op calls fsync()
> inside the implementation. So there is also no problem.

There is no requirement for a filesystem to flush data on close(). In
fact, most local filesystems do not. NFS does, but that's because it has
to in order to provide close-to-open cache consistency semantics.
-- 
Jeff Layton <jlayton@...hat.com>