linux-kernel - Re: Linux 2.6.29

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090325195747.GC27476@kernel.dk>
Date:	Wed, 25 Mar 2009 20:57:47 +0100
From:	Jens Axboe <jens.axboe@...cle.com>
To:	Ric Wheeler <rwheeler@...hat.com>
Cc:	Jeff Garzik <jeff@...zik.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
	Jesper Krogh <jesper@...gh.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Linux 2.6.29

On Wed, Mar 25 2009, Ric Wheeler wrote:
> Jens Axboe wrote:
>> On Wed, Mar 25 2009, Jeff Garzik wrote:
>>   
>>> Jens Axboe wrote:
>>>     
>>>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>>>       
>>>>> Linus Torvalds wrote:
>>>>>         
>>>>>> But I really don't understand filesystem people who think that  
>>>>>> "fsck" is the important part, regardless of whether the data is 
>>>>>> valid or not. That's just stupid and _obviously_ bogus.
>>>>>>           
>>>>> I think I can understand that point of view, at least:
>>>>>
>>>>> More customers complain about hours-long fsck times than they do  
>>>>> about  silent data corruption of non-fsync'd files.
>>>>>
>>>>>
>>>>>         
>>>>>> The point is, if you write your metadata earlier (say, every 5 
>>>>>> sec) and the real data later (say, every 30 sec), you're 
>>>>>> actually MORE LIKELY to see corrupt files than if you try to 
>>>>>> write them together.
>>>>>>
>>>>>> And if you write your data _first_, you're never going to see   
>>>>>> corruption at all.
>>>>>>           
>>>>> Amen.
>>>>>
>>>>> And, personal filesystem pet peeve:  please encourage proper 
>>>>> FLUSH CACHE  use to give users the data guarantees they deserve.  
>>>>> Linux's sync(2) and  fsync(2) (and fdatasync, etc.) should poke 
>>>>> the block layer to guarantee  a media write.
>>>>>         
>>>> fsync already does that, at least if you have barriers enabled on your
>>>> drive.
>>>>       
>>> Erm, no, you don't enable barriers on your drive, they are not a   
>>> hardware feature.  You enable barriers via your filesystem.
>>>     
>>
>> Thanks for the lesson Jeff, I'm obviously not aware how that stuff
>> works...
>>
>>   
>>> Stating "fsync already does that" borders on false, because that assumes
>>> (a) the user has a fs that supports barriers
>>> (b) the user is actually aware of a 'barriers' mount option and what 
>>> it  means
>>> (c) the user has turned on an option normally defaulted to off.
>>>
>>> Or in other words, it pretty much never happens.
>>>     
>>
>> That is true, except if you use xfs/ext4. And this discussion is fine,
>> as was the one a few months back that got ext4 to enable barriers by
>> default. If I had submitted patches to do that back in 2001/2 when the
>> barrier stuff was written, I would have been shot for introducing such a
>> slow down. After people found out that it just wasn't something silly,
>> then you have a way to enable it.
>>
>> I'd still wager that most people would rather have a 'good enough
>> fsync' on their desktops than incur the penalty of barriers or write
>> through caching. I know I do.
>>
>>   
>>> Furthermore, a blatantly obvious place to flush data to media --   
>>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the 
>>> block  layer to issue a FLUSH CACHE for __any__ filesystem.  But that 
>>> doesn't  happen either.
>>>
>>> So, no, for 95% of Linux users, fsync does _not_ already do that.  If 
>>>  you are lucky enough to use XFS or ext4, you're covered.  That's it.
>>>     
>>
>> The point is that you need to expose this choice somewhere, and that
>> 'somewhere' isn't manually editing fstab and enabling barriers or
>> fsync-for-real. And it should be easier.
>>
>> Another problem is that FLUSH_CACHE sucks. Really. And not just on
>> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
>> wit for the world to finish. Pretty hard to teach people to use a nicer
>> fdatasync(), when the majority of the cost now becomes flushing the
>> cache of that 1TB drive you happen to have 8 partitions on. Good luck
>> with that.
>>
>>   
> And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE  
> is per device (not file system).
>
> When you issue an fsync() on a disk with multiple partitions, you will  
> flush the data for all of its partitions from the write cache....

Exactly, that's what my (vague) 8 partition reference was for :-)
A range flush would be so much more palatable.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/