linux-kernel - Re: Linux 2.6.29

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090327055750.GA18065@srcf.ucam.org>
Date:	Fri, 27 Mar 2009 05:57:50 +0000
From:	Matthew Garrett <mjg59@...f.ucam.org>
To:	Theodore Tso <tytso@....edu>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Rees <drees76@...il.com>, Jesper Krogh <jesper@...gh.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Linux 2.6.29

On Fri, Mar 27, 2009 at 01:13:39AM -0400, Theodore Tso wrote:

> There were plenty of applications that were written for Unix *and*
> Linux systems before ext3 existed, and they worked just fine.  Back
> then, people were drilled into the fact that they needed to use
> fsync(), and fsync() wan't expensive, so there wasn't a big deal in
> terms of usability.  The fact that fsync() was expensive was precisely
> because of ext3's data=ordered problem.  Writing files safely meant
> that you had to check error returns from fsync() *and* close().  

And now life is better. UNIX's error handling has always meant that it's 
effectively impossible to ensure that data hits disk if you wander into 
a variety of error conditions, and by and large it's simply not worth 
worrying about them. You're generally more likely to hit a kernel bug or 
suffer hardware failure than find an error condition that can actually 
be handled in a sensible way, and the probability/effectiveness ratio is 
sufficiently low that there are better ways to spend your time unless 
you're writing absolutely mission critical code. So let's not focus on 
the risk of data loss from failing to check certain error conditions. 
It's a tiny risk compared to power loss.

> I can tell you quite authoritatively that we didn't implement
> data=ordered to make life easier for application writers, and
> application writers didn't come to ext3 developers asking for this
> convenience.  It may have **accidentally** given them convenience that
> they wanted, but it also made fsync() slow.  

It not only gave them that convenience, it *guaranteed* that 
convenience. And with ext3 being the standard filesystem in the Linux 
world, and every other POSIX system being by and large irrelevant[1], 
the real world effect of that was that Linux gave you that guarantee. 

> > I'm utterly and screamingly bored of this "Blame userspace" attitude. 
> 
> I'm not blaming userspace.  I'm blaming ourselves, for implementing an
> attractive nuisance, and not realizing that we had implemented an
> attractive nuisance; which years later, is also responsible for these
> latency problems, both with and without fsync() ---- *and* which have
> also traied people into believing that fsync() is always expensive,
> and must be avoided at all costs --- which had not previously been
> true!

But you're still arguing that applications should start using fsync(). 
I'm arguing that not only is this pointless (most of this code will 
never be "fixed") but it's also regressive. In most cases applications 
don't want the guarantees that fsync() makes, and given that we're going 
to have people running on ext3 for years to come they also don't want 
the performance hit that fsync() brings. Filesystems should just do the 
right thing, rather than losing people's data and then claiming that 
it's fine because POSIX said they could.

> If I had to do it all over again, I would have argued with Stephen
> about making data=writeback the default, which would have provided
> behaviour on crash just like ext2, except that we wouldn't have to
> fsck the partition afterwards.  Back then, people lived with the
> potential security exposure on a crash, and they lived with the fact
> that you had to use fsync(), or manually type "sync", if you wanted to
> guarantee that data would be safely written to disk.  And you know
> what?  Things had been this way with Unix systems for 31 years before
> ext3 came on the scene, and things worked pretty well during those
> three decades.

Well, no. fsync() didn't appear in early Unix, so what people were 
actually willing to live with was restoring from backups if the system 
crashed. I'd argue that things are somewhat better these days, 
especially now that we're used to filesystems that don't require us to 
fsync(), close(), fsync the directory and possibly jump through even 
more hoops if faced with a pathological interpretation of POSIX. 
Progress is a good thing. The initial behaviour of ext4 in this respect 
wasn't progress.

And, really, I'm kind of amused at someone arguing for a given behaviour 
on the basis of POSIX while also suggesting that sync() is in any way 
helpful for guaranteeing that data is on disk.

> So again, let it make it clear, I'm not "blaming userspace".  I'm
> blaming ext3 data=ordered mode.  But it's trained application writers
> to program systems a certain way, and it's trained them to assume that
> fsync() is always evil, and they outnumber us kernel programmers, and
> so we are where we are.  And data=ordered mode is also responsible for
> these write latency problems which seems to make Ingo so cranky ---
> and rightly so.  It all comes from the same source.

No. People continue to use fsync() where fsync() should be used - for 
guaranteeing that given information has hit disk. The problem is that 
you're arguing that application should use fsync() even when they don't 
want or need that guarantee. If anything, ext3 has been helpful in 
encouraging people to only use fsync() when they really need to - and 
that's a win for everyone.

[1] MacOS has users, but it's not a significant market for pure POSIX 
applications so isn't really an interesting counterexample
-- 
Matthew Garrett | mjg59@...f.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/