[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49D239A0.5080405@redhat.com>
Date: Tue, 31 Mar 2009 11:41:20 -0400
From: Ric Wheeler <rwheeler@...hat.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
CC: Jens Axboe <jens.axboe@...cle.com>,
Fernando Luis Vázquez Cao
<fernando@....ntt.co.jp>, Jeff Garzik <jeff@...zik.org>,
Christoph Hellwig <hch@...radead.org>,
Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Arjan van de Ven <arjan@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
Jesper Krogh <jesper@...gh.cc>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
chris.mason@...cle.com, david@...morbit.com, tj@...nel.org
Subject: Re: [PATCH 1/7] block: Add block_flush_device()
Linus Torvalds wrote:
>
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
>> Now you are just being silly. The drive and the write cache - without barriers
>> or similar tagged operations - will almost certainly reorder all of the IO's
>> internally.
>
> You do realize that the "drive" may not be a drive at all?
>
> But apparently you don't. You really seem to see just your own case, and
> have blinders on for everything else.
>
> That "drive" may be some virtualized device. It may be some super-fancy
> memory mapped and largely undocumented random flash thing. It might be a
> network block device, it may be somebody's IO trace dummy layer, it may be
> anything at all.
Of course I realize that.
Most of the SSD devices, including ones that don't speak normal S-ATA/SCSI/etc,
they have a write cache and will combine and re-order IO's.
Some of them have non-volatile write caches and those don't need barriers
(flush, fua, what ever) because of batteries, capacitors or other magic hardware
people came up with.
For the ones that do have a volatile write cache and can reorder IO's,
transactions will still need the ordering primitives to survive a power failure
reliably.
If you don't need or want to pay the price of ordering, you can today easily
disable this by mounting without barriers.
As Mark pointed out, most S-ATA/SAS drives will flush the write cache when they
see a bus reset so even without barriers, the cache will be preserved (or
flushed) after a reboot or panic. Power outages are the problem
barriers/flushes are meant to help with.
>
> Your filesystem doesn't know. It damn well not even _try_ to know, because
> it isn't the low-level driver.
>
> The low-level driver - which you don't have a friggin clue about - may say
> that it doesn't support barrier IO for any random reason that has
> absolutely _nothing_ to do with any write caches or anything else. Maybe
> the device has the same ordering semantics as an Intel CPU has: writes are
> always seen in order on the disk, and reads are always speculated but will
> snoop in write buffers, and ther is no way to not do that.
>
> See? EOPNOTSUPP means just that - it means that the driver doesn't support
> the notion of ordered IO. But that does not necessarily mean that the
> writes aren't always in order. It may well just mean that the drive is a
> thin shimmy layer over something else (for example, just a user level
> pipe), and the driver has NO IDEA what the end result is, and the protocol
> is simplistic and is just 'read' and 'write' and absolutely nothing else.
>
> But you seem to NOT UNDERSTAND THIS.
>
> I'm not interested in your inane drivel. Let's just say that your lack of
> understanding just means that your input is irrelevant, and leave it at
> that. Ok? Until you can see the bigger picture, just don't bother.
>
> Linus
If the low level device returns EOPNOTSUPP on a barrier op, that is fine.
Running a transactional file system on that storage might or might not be a good
idea, but at least we can log that and move on.
I agree with Chris that what happens when the device does not support the
primitives is not the core issue.
The question is really what we do when you have a storage device in your box
with a volatile write cache that does support flush or fua or similar. Using
barriers & ordered transactions for these types of devices will give you a more
reliable file system - less fsck time needed and better data integrity support
for the (few?) applications that use fsync properly.
Ric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists