lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140217191901.GC16660@thunk.org>
Date:	Mon, 17 Feb 2014 14:19:01 -0500
From:	Theodore Ts'o <tytso@....edu>
To:	"Martin K. Petersen" <martin.petersen@...cle.com>
Cc:	linux-fsdevel@...r.kernel.org, Jens Axboe <axboe@...nel.dk>,
	linux-ext4@...r.kernel.org
Subject: Re: [PATCH RFC] block: use discard if possible in
 blkdev_issue_discard()

On Mon, Feb 17, 2014 at 11:44:27AM -0500, Martin K. Petersen wrote:
> Ted> Basically, who was practicing engineering malpractice?  The SSD
> Ted> vendors, or the T10/T13 spec authors?
> 
> I think it's important to emphasize that T10/T13 specs are mainly
> written by device vendors. And they have a very strong objection to
> complicating the device firmware, keeping internal state, etc. So the
> outcome is very rarely in the operating system's favor. I completely
> agree that these flags are broken by definition.

Sigh...

One of the reasons why this came up is if you are implementing a cloud
hosting service, where disk is emulated, and since you are trying to
do something cheap-cheap-cheap (for example, OpenShift from Red Hat
has a very generous free guests policy), it's likely that you're using
something like qcow2, or thinp, or something similar to emulate disks
to drive storage costs down.  So anything we can do to eliminate I/O
work at the Host OS layer is going to be really visible, and this
includes replacing zero-block writes with the equivalent of punch or
TRIM w/ ZRAT.

> The only discard approach that provides a guaranteed result is WRITE
> SAME with the UNMAP bit set (i.e. SCSI only).

So currently blkdev_issue_zeroout() will do the WRITE SAME, but it
doesn't set the UNMAP bit, correct?  I understand there will be
environments where performance is more important than cost, where it
may not be a good idea to set the UNMAP bit.  So it sounds like what
we should do is add a flags which controls whether or not to use TRIM
w/ ZRAT or WRITE SAME with the UNMAP bit is set.

We'll then also need to work with the KVM folks to make sure that
WRITE SAME w/ UNMAP gets plumbed through to the KVM userspace, which
can then do something like FL_PUNCH if it is using a raw sparse image,
or the equivalent in qcow2, etc.

(If the KVM folks want to be even more aggressive, if they know they
are using an underlying storage system where keeping the allocated
blocks isn't really going to help performance, even if the UNMAP bit
isn't set and the data block is all zero's, maybe they might want to
unmap the block(s) anyway.  Or we could leave this up to the Guest OS
userspace, and plumb a hint from the Host to the Guest that it should
really use WRITE SAME w/ UNMAP.  But I'm not convinced it's worth it.)

Does this sound like a reasonable way to go?

> The good news is that most devices that report DRAT/RZAT are doing the
> right thing due to server/RAID vendor pressure.   But SSD vendors are
> generally not willing to give such guarantees in the datasheets.

I imagine the reason why they aren't willing to give such guarantees
is that it would cost more to do the testing to assure this, and while
they know that a certain firmwar version shipped to $BIG_HDD_CUSTOMER
does the right thing, it might regress without their knowing about it
in some future firmware version.

On the other hand, if there was a white list kept somewhere, either in
the kernel, or in some more dynamically updated list (ala what
smartctl does to get the latest vendor-specific attributes), being on
the white list might be enough of a commercial advantage that drive
vendors would be incentivized to provide such a guarantee.  Especially
if, say, a major SSD vendor such as Intel could be induced make such a
public guarantee and we publicized this fact.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ