[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49CA8ADA.3040709@redhat.com>
Date: Wed, 25 Mar 2009 15:49:46 -0400
From: Ric Wheeler <rwheeler@...hat.com>
To: Jens Axboe <jens.axboe@...cle.com>
CC: Jeff Garzik <jeff@...zik.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Arjan van de Ven <arjan@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
Jesper Krogh <jesper@...gh.cc>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Linux 2.6.29
Jens Axboe wrote:
> On Wed, Mar 25 2009, Jeff Garzik wrote:
>
>> Jens Axboe wrote:
>>
>>> On Tue, Mar 24 2009, Jeff Garzik wrote:
>>>
>>>> Linus Torvalds wrote:
>>>>
>>>>> But I really don't understand filesystem people who think that
>>>>> "fsck" is the important part, regardless of whether the data is
>>>>> valid or not. That's just stupid and _obviously_ bogus.
>>>>>
>>>> I think I can understand that point of view, at least:
>>>>
>>>> More customers complain about hours-long fsck times than they do
>>>> about silent data corruption of non-fsync'd files.
>>>>
>>>>
>>>>
>>>>> The point is, if you write your metadata earlier (say, every 5 sec)
>>>>> and the real data later (say, every 30 sec), you're actually MORE
>>>>> LIKELY to see corrupt files than if you try to write them together.
>>>>>
>>>>> And if you write your data _first_, you're never going to see
>>>>> corruption at all.
>>>>>
>>>> Amen.
>>>>
>>>> And, personal filesystem pet peeve: please encourage proper FLUSH
>>>> CACHE use to give users the data guarantees they deserve. Linux's
>>>> sync(2) and fsync(2) (and fdatasync, etc.) should poke the block
>>>> layer to guarantee a media write.
>>>>
>>> fsync already does that, at least if you have barriers enabled on your
>>> drive.
>>>
>> Erm, no, you don't enable barriers on your drive, they are not a
>> hardware feature. You enable barriers via your filesystem.
>>
>
> Thanks for the lesson Jeff, I'm obviously not aware how that stuff
> works...
>
>
>> Stating "fsync already does that" borders on false, because that assumes
>> (a) the user has a fs that supports barriers
>> (b) the user is actually aware of a 'barriers' mount option and what it
>> means
>> (c) the user has turned on an option normally defaulted to off.
>>
>> Or in other words, it pretty much never happens.
>>
>
> That is true, except if you use xfs/ext4. And this discussion is fine,
> as was the one a few months back that got ext4 to enable barriers by
> default. If I had submitted patches to do that back in 2001/2 when the
> barrier stuff was written, I would have been shot for introducing such a
> slow down. After people found out that it just wasn't something silly,
> then you have a way to enable it.
>
> I'd still wager that most people would rather have a 'good enough
> fsync' on their desktops than incur the penalty of barriers or write
> through caching. I know I do.
>
>
>> Furthermore, a blatantly obvious place to flush data to media --
>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the block
>> layer to issue a FLUSH CACHE for __any__ filesystem. But that doesn't
>> happen either.
>>
>> So, no, for 95% of Linux users, fsync does _not_ already do that. If
>> you are lucky enough to use XFS or ext4, you're covered. That's it.
>>
>
> The point is that you need to expose this choice somewhere, and that
> 'somewhere' isn't manually editing fstab and enabling barriers or
> fsync-for-real. And it should be easier.
>
> Another problem is that FLUSH_CACHE sucks. Really. And not just on
> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and
> wit for the world to finish. Pretty hard to teach people to use a nicer
> fdatasync(), when the majority of the cost now becomes flushing the
> cache of that 1TB drive you happen to have 8 partitions on. Good luck
> with that.
>
>
And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE
is per device (not file system).
When you issue an fsync() on a disk with multiple partitions, you will
flush the data for all of its partitions from the write cache....
ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists