linux-kernel - Re: document ext3 requirements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200901041950.12331.rob@landley.net>
Date:	Sun, 4 Jan 2009 19:50:11 -0600
From:	Rob Landley <rob@...dley.net>
To:	Pavel Machek <pavel@...e.cz>
Cc:	kernel list <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...l.org>, tytso@....edu,
	mtk.manpages@...il.com, rdunlap@...otime.net,
	linux-doc@...r.kernel.org
Subject: Re: document ext3 requirements

On Sunday 04 January 2009 16:55:45 Pavel Machek wrote:
> On Sun 2009-01-04 13:49:49, Rob Landley wrote:
> > On Saturday 03 January 2009 06:38:15 Pavel Machek wrote:
> > > +Ext3 expects disk/storage subsystem to behave sanely. On sanely
> > > +behaving disk subsystem, data that have been successfully synced will
> > > +stay on the disk. Sane means:
> > > +
> > > +* writes to media never fail. Even if disk returns error condition
> > > during +  write, ext3 can't handle that correctly, because success on
> > > fsync was already +  returned when data hit the journal.
> > > +
> > > +	   (Fortunately writes failing are very uncommon on disks, as they
> > > +	   have spare sectors they use when write fails.)
> > > +
> > > +* either whole sector is correctly written or nothing is written
> > > during +  powerfail.
> > > +
> > > +	   (Unfortuantely, none of the cheap USB/SD flash cards I seen do
> > > behave +	   like this, and are unsuitable for ext3.
> >
> > Want to document the granularity issues with flash, while you're at it?
> >
> > An inherent problem with using flash as a normal block device is that the
> > flash erase size is bigger than most filesystem sector sizes.  So when
> > you request a write, it may erase and rewrite the next 64k, 128k, or even
> > a couple megabytes on the really _big_ ones.
> >
> > If you lose power in the middle of that, ext3 won't notice that data in
> > the "sectors" _after_ the one your were trying to write to got trashed.
> >
> > The flash filesystems take this into account as part of their wear
> > levelling stuff (they normally copy the entire chunk into a new chunk,
> > leaving the old one in place until it's no longer needed), but they need
> > to query the device to get the erase granularity in order to do that,
> > which is why they don't work on non-flash block devices.
>
> Is there linux filesystem that can handle that? I know jffs2, but
> that's unsuitable for stuff like USB thumb drives, right?

Any of the flash filesystems should handle that.  The main problem with jffs2 
is it doesn't scale well to large device sizes.  UBIFS is supposed to scale 
much better, but I haven't played with it yet.

And the thing about USB thumb drives is they present as a normal block device, 
_not_ as flash, so you can't _query_ their erase granularity.  (It's like 
those hardware raid cards that wouldn't tell you they were striping and such 
so you had to figure out a well-performing layout all by yourself.)   They do 
it magically behind the scenes, and if the power goes out (or you yank the 
device out unexpectedly) if they haven't got a built-in capacitor or battery 
to have enough power to complete their pending transaction, you're screwed.

Plus they do horrible wear levelling, the lot of 'em.  Read Val Henson's 
livejournal entry about it: http://valhenson.livejournal.com/25228.html

There was also a marvelous thread Linus participated in on some hardware 
industry web message board, but I have no idea where it's gone...

> Does this sound like a fair summary?

See Ted's comment.  The summary's fine, the question is where to put this sort 
of thing...

>                 If you lose power in the middle of that, filesystem
>                 won't notice that data in the "sectors" _after_ the
>                 one your were trying to write to got trashed.

Well, the journal won't notice.  An e2fsck will notice huge swaths of missing 
metadata, but won't be able to do anything about it.  (And if what got zapped 
was file _contents_ rather than metadata, you're on your own finding it.  Fun, 
isn't it?)

> 									Pavel

Rob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/