linux-kernel - Re: writing file to disk: not as easy as it looks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081203084639.GB1944@ucw.cz>
Date:	Wed, 3 Dec 2008 09:46:40 +0100
From:	Pavel Machek <pavel@...e.cz>
To:	Theodore Tso <tytso@....edu>, Chris Friesen <cfriesen@...tel.com>,
	mikulas@...ax.karlin.mff.cuni.cz, clock@...ey.karlin.mff.cuni.cz,
	kernel list <linux-kernel@...r.kernel.org>, aviro@...hat.com
Subject: Re: writing file to disk: not as easy as it looks

On Wed 2008-12-03 00:07:09, Theodore Tso wrote:
> On Tue, Dec 02, 2008 at 11:44:03PM +0100, Pavel Machek wrote:
> > > >
> > > > Yikes.  I was under the impression that once the journal hit the platter  
> > > > then the data were safe (barring media corruption).
> > > 
> > > Well, this is a case of media corruption (or a cosmic ray hitting
> > > hitting a ribbon cable in the disk controller sending the write to the
> > > wrong location on disk, or someone bumping the server causing the disk
> > > head to lift up a little higher than normal while it was writing the
> > > disk sector, etc.).  But it is a case of the hard drive misbehaving. 
> > 
> > I could not parse this. Negation seems to be missing somewhere.
> 
> I was agreeing with your original statement.  Once the journal hits
> the platter, the data is safe, barring hard drive malfunctions (not
> just media corruption).  I was just listing the many other types of
> hard drive failures that could cause data loss.

Aha, ok, sorry for confusion.

> > Ok, "memory failed before disk" is ... bad hardware.
> 
> It's PC class hardware.  Live with it.  Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in their
> power supplies, and when Irix got a powerfail interrupt, it would
> frantically run around aborting DMA transfers to avoid this particular
> problem.  At least, that's what an old-timer SGI engineer (who is
> unfortunately no longer at SGI) told me.
> 
> PC class hardware don't have power fail interrupts.  Hence, my advice
> to you is that if you use a filesystem that does logical journalling
> --- better have a UPS.

Hmm, 'just avoid logical journalling' seems like a better solution
:-).

> > ...but... you seem to be saying that modern filesystems can damage
> > data even on "sane" hardware.
> 
> The example I gave was one where a disk failure could cause a file
> that had previously been sucessfully written to disk and fsync()'ed to
> be damaged by another filesystem operation ***in the face of hard
> drive failure***.  Surely that is obvious.  The most obvious case of

Ok.

> The example I gave, where a b-tree is doing split, and there is a
> failure writing to the b-tree causing ancillary damage files
> referenced in the b-tree node getting split, can happen with **any**
> filesystem.  The only thing that will save you here would be a
> copy-on-write type filesystem, such as WAFL or btrfs.

ext3-like physical journaling could be extended to handle write
failures (at speed penalty), no?

Write 'I will rewrite block A containing B with C' into journal... ok,
I guess I should wait for btrfs.

> > You seem to be saying that ext2/ext3 only work if these are met:
> > 
> > 1) power may fail any time.
> 
> Well, ext2/ext3 will work fine if the power is always reliable, too.  :-)

:-) ok.

> > 2) writes are always successful.
> 
> To the extent that write failures while writing filesystem metdata
> can, if you are unluky be catastrophic, yeah.  Fortunally normally
> such write failures are fairly rare, but if you worry about such
> things, RAID is the answer.  As I said, I believe this is going to be
> true for pretty much any update-in-place filesystem.  It's always
> possible to construct failure scenarios if the hardware is unreliable.

Ok.

> > 3) connection to the disk always works.
> > 
> > AFAICT it is unsafe to run ext2/ext3 on any media that can be removed
> > without unmounting (missing fsync error propagation), and it is unsafe
> > to run ext2/ext3 on any flash-based storage with block interface (SD
> > cards, flash sticks).
> 
> The data on the disk before the connection is yanked should be safe
> (although as we mentioned in another thread, the flash drive itself
> may not be happy if you are writing to the Flash Translation Layer at
> the time when power is cut; if that causes a previously written sector
> to disappear, that's an example of a hardware failure that **any**
> filesystem won't necessarily be able to recover from).
> 
> Your definition of "safe" seems to include worrying about making sure
> that all processes that may have previously touched a file or a
> directory gets an error when they try to do an fsync() on that file or
> directory, and that given that fsync clears the error condition after
> it returns it,it is therefore "unsafe".  

Yes. fsync() seeems surprisingly high on Rusty's list of broken
interfaces classification ('impossible to use correctly').

I wonder if some reasonable solution exists? Mark filesystem as failed
on first  write error is one of those (and default for ext2/3?). Did
SGI/big unixen solve this somehow?

> The reality is that most applications don't proper error checking, and
> even fewer actually call fsync(), so if you are putting your root
> filesytem on a 32G flash card, and it pops out easily due to hardware
> design issues, the question of whether fsync() gets properly progated
> to all potentially interested applications is the ***least*** of your
> worries.

Yes, most applications are bad. Yes, I should just glue the card into
the slot. No, fsync interface does not look properly designed. No, it
is not causing me immediate problems (mount -o dirsync mostly works
around that). I wonder if good, long-term solution exists...


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/