[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090325195747.GC27476@kernel.dk>
Date: Wed, 25 Mar 2009 20:57:47 +0100
From: Jens Axboe <jens.axboe@...cle.com>
To: Ric Wheeler <rwheeler@...hat.com>
Cc: Jeff Garzik <jeff@...zik.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Arjan van de Ven <arjan@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
Jesper Krogh <jesper@...gh.cc>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Linux 2.6.29
On Wed, Mar 25 2009, Ric Wheeler wrote:
> Jens Axboe wrote:
>> On Wed, Mar 25 2009, Jeff Garzik wrote:
>>
>>> Jens Axboe wrote:
>>>
>>>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>>>
>>>>> Linus Torvalds wrote:
>>>>>
>>>>>> But I really don't understand filesystem people who think that
>>>>>> "fsck" is the important part, regardless of whether the data is
>>>>>> valid or not. That's just stupid and _obviously_ bogus.
>>>>>>
>>>>> I think I can understand that point of view, at least:
>>>>>
>>>>> More customers complain about hours-long fsck times than they do
>>>>> about silent data corruption of non-fsync'd files.
>>>>>
>>>>>
>>>>>
>>>>>> The point is, if you write your metadata earlier (say, every 5
>>>>>> sec) and the real data later (say, every 30 sec), you're
>>>>>> actually MORE LIKELY to see corrupt files than if you try to
>>>>>> write them together.
>>>>>>
>>>>>> And if you write your data _first_, you're never going to see
>>>>>> corruption at all.
>>>>>>
>>>>> Amen.
>>>>>
>>>>> And, personal filesystem pet peeve: please encourage proper
>>>>> FLUSH CACHE use to give users the data guarantees they deserve.
>>>>> Linux's sync(2) and fsync(2) (and fdatasync, etc.) should poke
>>>>> the block layer to guarantee a media write.
>>>>>
>>>> fsync already does that, at least if you have barriers enabled on your
>>>> drive.
>>>>
>>> Erm, no, you don't enable barriers on your drive, they are not a
>>> hardware feature. You enable barriers via your filesystem.
>>>
>>
>> Thanks for the lesson Jeff, I'm obviously not aware how that stuff
>> works...
>>
>>
>>> Stating "fsync already does that" borders on false, because that assumes
>>> (a) the user has a fs that supports barriers
>>> (b) the user is actually aware of a 'barriers' mount option and what
>>> it means
>>> (c) the user has turned on an option normally defaulted to off.
>>>
>>> Or in other words, it pretty much never happens.
>>>
>>
>> That is true, except if you use xfs/ext4. And this discussion is fine,
>> as was the one a few months back that got ext4 to enable barriers by
>> default. If I had submitted patches to do that back in 2001/2 when the
>> barrier stuff was written, I would have been shot for introducing such a
>> slow down. After people found out that it just wasn't something silly,
>> then you have a way to enable it.
>>
>> I'd still wager that most people would rather have a 'good enough
>> fsync' on their desktops than incur the penalty of barriers or write
>> through caching. I know I do.
>>
>>
>>> Furthermore, a blatantly obvious place to flush data to media --
>>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the
>>> block layer to issue a FLUSH CACHE for __any__ filesystem. But that
>>> doesn't happen either.
>>>
>>> So, no, for 95% of Linux users, fsync does _not_ already do that. If
>>> you are lucky enough to use XFS or ext4, you're covered. That's it.
>>>
>>
>> The point is that you need to expose this choice somewhere, and that
>> 'somewhere' isn't manually editing fstab and enabling barriers or
>> fsync-for-real. And it should be easier.
>>
>> Another problem is that FLUSH_CACHE sucks. Really. And not just on
>> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
>> wit for the world to finish. Pretty hard to teach people to use a nicer
>> fdatasync(), when the majority of the cost now becomes flushing the
>> cache of that 1TB drive you happen to have 8 partitions on. Good luck
>> with that.
>>
>>
> And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE
> is per device (not file system).
>
> When you issue an fsync() on a disk with multiple partitions, you will
> flush the data for all of its partitions from the write cache....
Exactly, that's what my (vague) 8 partition reference was for :-)
A range flush would be so much more palatable.
--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists