linux-kernel - Re: Linux 2.6.29

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090325194341.GB27476@kernel.dk>
Date:	Wed, 25 Mar 2009 20:43:41 +0100
From:	Jens Axboe <jens.axboe@...cle.com>
To:	Jeff Garzik <jeff@...zik.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
	Jesper Krogh <jesper@...gh.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Linux 2.6.29

On Wed, Mar 25 2009, Jeff Garzik wrote:
> Jens Axboe wrote:
>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>> Linus Torvalds wrote:
>>>> But I really don't understand filesystem people who think that 
>>>> "fsck" is the important part, regardless of whether the data is 
>>>> valid or not. That's just stupid and _obviously_ bogus.
>>> I think I can understand that point of view, at least:
>>>
>>> More customers complain about hours-long fsck times than they do 
>>> about  silent data corruption of non-fsync'd files.
>>>
>>>
>>>> The point is, if you write your metadata earlier (say, every 5 sec) 
>>>> and the real data later (say, every 30 sec), you're actually MORE 
>>>> LIKELY to see corrupt files than if you try to write them together.
>>>>
>>>> And if you write your data _first_, you're never going to see  
>>>> corruption at all.
>>> Amen.
>>>
>>> And, personal filesystem pet peeve:  please encourage proper FLUSH 
>>> CACHE  use to give users the data guarantees they deserve.  Linux's 
>>> sync(2) and  fsync(2) (and fdatasync, etc.) should poke the block 
>>> layer to guarantee  a media write.
>>
>> fsync already does that, at least if you have barriers enabled on your
>> drive.
>
> Erm, no, you don't enable barriers on your drive, they are not a  
> hardware feature.  You enable barriers via your filesystem.

Thanks for the lesson Jeff, I'm obviously not aware how that stuff
works...

> Stating "fsync already does that" borders on false, because that assumes
> (a) the user has a fs that supports barriers
> (b) the user is actually aware of a 'barriers' mount option and what it  
> means
> (c) the user has turned on an option normally defaulted to off.
>
> Or in other words, it pretty much never happens.

That is true, except if you use xfs/ext4. And this discussion is fine,
as was the one a few months back that got ext4 to enable barriers by
default. If I had submitted patches to do that back in 2001/2 when the
barrier stuff was written, I would have been shot for introducing such a
slow down. After people found out that it just wasn't something silly,
then you have a way to enable it.

I'd still wager that most people would rather have a 'good enough
fsync' on their desktops than incur the penalty of barriers or write
through caching. I know I do.

> Furthermore, a blatantly obvious place to flush data to media --  
> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block  
> layer to issue a FLUSH CACHE for __any__ filesystem.  But that doesn't  
> happen either.
>
> So, no, for 95% of Linux users, fsync does _not_ already do that.  If  
> you are lucky enough to use XFS or ext4, you're covered.  That's it.

The point is that you need to expose this choice somewhere, and that
'somewhere' isn't manually editing fstab and enabling barriers or
fsync-for-real. And it should be easier.

Another problem is that FLUSH_CACHE sucks. Really. And not just on
ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
wit for the world to finish. Pretty hard to teach people to use a nicer
fdatasync(), when the majority of the cost now becomes flushing the
cache of that 1TB drive you happen to have 8 partitions on. Good luck
with that.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/