linux-kernel - Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1355228163.2721.32.camel@menhir>
Date:	Tue, 11 Dec 2012 12:16:03 +0000
From:	Steven Whitehouse <swhiteho@...hat.com>
To:	"Theodore Ts'o" <tytso@....edu>
Cc:	Dave Chinner <david@...morbit.com>,
	Chris Mason <chris.mason@...ionio.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ric Wheeler <rwheeler@...hat.com>,
	Ingo Molnar <mingo@...nel.org>,
	Christoph Hellwig <hch@...radead.org>,
	Martin Steigerwald <Martin@...htvoll.de>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to
 fallocate UAPI

Hi,

On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote:
> A sentence or two got chopped out during an editing pass.  Let me try
> that again so it's a bit clearer what I was trying to say....
> 
> Sure, but if the block device supports WRITE_SAME or persistent
> discard, then presumably fallocate() should do this automatically all
> the time, and not require a flag to request this behavior.  The only
> reason why you might not is if the WRITE_SAME is more costly.  That is
> when a seek plus writing 1MB does take more time than the amount of
> disk time fraction that it consumes if you compare it to a seek plus
> writing 4k or 32k.
> 
Well there are two cases here I think....

One is the GFS2 type case where the metadata doesn't support "these
blocks are allocated but zero" so that we must, for all fallocate
requests, zero out the blocks at fallocate time to avoid exposing stale
data to userspace.

The advantage over dd from userspace in this case is firstly that no
copy from userspace means that it should be faster. Also the use of
sb_issue_zeroout means that block devices which don't need an explicit
block of zeros to write should be able to do this faster - however that
is implemented at the block layer. The fs shouldn't need to care about
how is it implemented. In the case of GFS2, we implemented fallocate
because it was useful to have the feature of being able to allocate
beyond the end of file without changing the file size. This helped us
fix a bug in our fs grow code, so performance was a secondary (but
welcome!) consideration. 

The other case is ext4/XFS type case where the metadata does support
"these blocks are allocated but zero" which means that the metadata
needs to be changed twice. Once to "these blocks are allocated but zero"
at fallocate time and again to "these blocks have valid content" at
write time. As I understand the issue, the problem is that this second
metadata change is what is causing the performance issue.

> Ext4 currently uses a threshold of 32k for this break point (below
> that, we will use sb_issue_zeroout; above that, we will break apart an
> uninitialized extent when writing into a preallocated region).  It may
> be that 32k is too low, especailly for certain types of devices (i.e.,
> SSD's versus RAID 5, where it should be aligned on a RAID strip,
> etc.).  More of an issue might be that there will be some disagreement
> about whether people want to the system to automatically tune for
> average throughput vs 99.9 percentile latency.
> 
> Regardless, this is actually something which I think the file system
> should try to do automatically if at all possible, via some kind of
> auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
> (See, I don't propose using a new fallocate flag for everything.  :-)
> 
>       	      	      	      - Ted
> 

It sounds like it might well be worth experimenting with the thresholds
as you suggest, 32k is really pretty small. I guess that the real
question here is what is the cost of the metadata change (to say what is
written and what remains unwritten) vs. simply zeroing out the unwritten
blocks in the extent when the write occurs.

There are likely to be a number of factors affecting that, and the
answer doesn't appear straightforward,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/