linux-kernel - Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121208040435.GB15784@dastard>
Date:	Sat, 8 Dec 2012 15:04:35 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Ric Wheeler <rwheeler@...hat.com>, Theodore Ts'o <tytso@....edu>,
	Chris Mason <chris.mason@...ionio.com>,
	Chris Mason <clmason@...ionio.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Ingo Molnar <mingo@...nel.org>,
	Christoph Hellwig <hch@...radead.org>,
	Martin Steigerwald <Martin@...htvoll.de>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	linux-fsdevel <linux-fsdevel@...r.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate
 UAPI

On Fri, Dec 07, 2012 at 06:52:51PM -0800, Joel Becker wrote:
> On Sat, Dec 08, 2012 at 11:39:36AM +1100, Dave Chinner wrote:
> > On Fri, Dec 07, 2012 at 05:02:32PM -0500, Ric Wheeler wrote:
> > > On 12/07/2012 04:57 PM, Theodore Ts'o wrote:
> > > >On Fri, Dec 07, 2012 at 04:42:06PM -0500, Ric Wheeler wrote:
> > > >>The other things that I think we should try would be to convert over
> > > >>larger chunks as we discussed on the list back in the summer (just
> > > >>because the user writes 4KB does not mean that we cannot flip over
> > > >>1MB and zero that).
> > > >Writing a megabyte is not free.  If you assume that your HDD has a
> > > >sustained write throughput of 100-125 MB/s, writing a megabyte will
> > > >take 8-10ms.  It might be a win if you amortize it over a large number
> > > >of writes, but it doesn't help your 99.9 percentile latency numbers.
> > > >(99.9 percentile latency numbers matters because eventually you'll
> > > >have a user request which hits multiple serial long latency
> > > >operations, and then the delay looks **really** user visible.)
> > > >
> > > >	    	     	       	     		- Ted
> > > 
> > > Writing 4KB at a time to a disk cost XX units of time.
> > > 
> > > Writing to the same sector (especially for a HDD), cost XX units + a small amount.
> > > 
> > > I suggest that we try it out.
> > > 
> > > For SSD's, much better to use specific HW offload commands if
> > > possible like WRITE_SAME (zeroed) or UNMAP/TRIM to get that
> > > performance boost since no actual data is moved...
> > 
> > Yup, that could be done quite trivially in XFS. Just mark the
> > preallocated extents as "busy" rather than unwritten, mark the
> > transaction as synchronous and the transaction commit will issue a
> > discard on the preallocated ranges before returning to userspace.
> > The extra overhead to the preallocation command is unlikely to be
> > noticed, and unwritten extent conversion overhead just goes away...
> > 
> > No fallocate() API changes necessary, though I think it would be
> > better if the user application gave a hint that it preferred "writing
> > zeros" (i.e. FALLOC_FL_WRITE_ZEROS) to allocating unwritten extents
> > as there are workloads where one will always be clearly better than
> > the other...
> 
> 	Wait, I missed something.  We're letting fallocate be dumb?
> Let's not do that, then.

No, not at all. Read again. There are workloads where explicitly
using unwritten extents are the best thing to do. For others,
zeroing rather than using unwritten extents may be better. All I
suggested was an additional flag that allows applications to tell
the filesystem preallocate zero space, but to do it via writing
zeros rather than unwritten extents.

> 	Over in ocfs2-land, we CoW in 1MB hunks.  That's the entire
> extent if it is 1MB or less, or some MB multiple if it is large enough
> to slice it.  This is for very similar reasons to unwritten clearing,
> with the added benefit of less fragmentation from CoW.
> 	On spinning media, any read/write of up to 1MB is roughly about
> the same penalty as reading/writing a sector.  You're already paying the
> seek.

Sure, but that's filesystem implementation details, and something
that I don't care about. Every filesystem implements preallocation
via fallocate in a similar manner (i.e. all have copied XFS's
unwritten extents technique) but that doesn't mean it's optimal for
every workload that needs preallocation.

The deficiencies of ext4's unwritten extent implementation is what
Ted has been trying to address by exposing stale data, rather than
looking at the problem as "is there a better way to preallocate for
this workload?"

That's where:

> On SSD, WRITE_SAME is *way* better than leaking data.

TRIM/WRITE_SAME can be used. It's way faster than actually writing
zeros, but from the user perspective, that's exactly how it appears
to them. IOWs, filesystems can implement the FALLOC_FL_WRITE_ZEROS method 
using this hardware offload, just be dumb and write zeros through
the page cache or ignore it altogether and just use unwritten
extents anyway...

> 	At the end of the day, you have to pay for zeroing.  You can do
> it up front, or you can do it at write time.

And the application should be able to tell us which it prefers....

....

> 	We should not be leaking data so that we can be lazy.

You're in violent agreement with me about that.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/