lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49D239A0.5080405@redhat.com>
Date:	Tue, 31 Mar 2009 11:41:20 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
CC:	Jens Axboe <jens.axboe@...cle.com>,
	Fernando Luis Vázquez Cao 
	<fernando@....ntt.co.jp>, Jeff Garzik <jeff@...zik.org>,
	Christoph Hellwig <hch@...radead.org>,
	Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
	Jesper Krogh <jesper@...gh.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	chris.mason@...cle.com, david@...morbit.com, tj@...nel.org
Subject: Re: [PATCH 1/7] block: Add block_flush_device()

Linus Torvalds wrote:
> 
> On Tue, 31 Mar 2009, Ric Wheeler wrote:
>> Now you are just being silly. The drive and the write cache - without barriers
>> or similar tagged operations - will almost certainly reorder all of the IO's
>> internally.
> 
> You do realize that the "drive" may not be a drive at all?
> 
> But apparently you don't. You really seem to see just your own case, and 
> have blinders on for everything else.
> 
> That "drive" may be some virtualized device. It may be some super-fancy 
> memory mapped and largely undocumented random flash thing. It might be a 
> network block device, it may be somebody's IO trace dummy layer, it may be 
> anything at all.

Of course I realize that.

Most of the SSD devices, including ones that don't speak normal S-ATA/SCSI/etc, 
they have a write cache and will combine and re-order IO's.

Some of them have non-volatile write caches and those don't need barriers 
(flush, fua, what ever) because of batteries, capacitors or other magic hardware 
people came up with.

For the ones that do have a volatile write cache and can reorder IO's, 
transactions will still need the ordering primitives to survive a power failure 
reliably.

If you don't need or want to pay the price of ordering, you can today easily 
disable this by mounting without barriers.

As Mark pointed out, most S-ATA/SAS drives will flush the write cache when they 
see a bus reset so even without barriers, the cache will be preserved (or 
flushed) after a reboot or panic.  Power outages are the problem 
barriers/flushes are meant to help with.

> 
> Your filesystem doesn't know. It damn well not even _try_ to know, because 
> it isn't the low-level driver.
> 
> The low-level driver - which you don't have a friggin clue about - may say 
> that it doesn't support barrier IO for any random reason that has 
> absolutely _nothing_ to do with any write caches or anything else. Maybe 
> the device has the same ordering semantics as an Intel CPU has: writes are 
> always seen in order on the disk, and reads are always speculated but will 
> snoop in write buffers, and ther is no way to not do that.
> 
> See? EOPNOTSUPP means just that - it means that the driver doesn't support 
> the notion of ordered IO. But that does not necessarily mean that the 
> writes aren't always in order. It may well just mean that the drive is a 
> thin shimmy layer over something else (for example, just a user level 
> pipe), and the driver has NO IDEA what the end result is, and the protocol 
> is simplistic and is just 'read' and 'write' and absolutely nothing else.
> 
> But you seem to NOT UNDERSTAND THIS.
> 
> I'm not interested in your inane drivel. Let's just say that your lack of 
> understanding just means that your input is irrelevant, and leave it at 
> that. Ok? Until you can see the bigger picture, just don't bother.
> 
> 			Linus


If the low level device returns EOPNOTSUPP on a barrier op, that is fine. 
Running a transactional file system on that storage might or might not be a good 
idea, but at least we can log that and move on.

I agree with Chris that what happens when the device does not support the 
primitives is not the core issue.

The question is really what we do when you have a storage device in your box 
with a volatile write cache that does support flush or fua or similar. Using 
barriers & ordered transactions for these types of devices will give you a more 
reliable file system - less fsck time needed and better data integrity support 
for the (few?) applications that use fsync properly.


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ