linux-kernel - Re: fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20160330231700.GN11812@dastard>
Date:	Thu, 31 Mar 2016 10:17:00 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Kent Overstreet <kent.overstreet@...il.com>
Cc:	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	Al Viro <viro@...iv.linux.org.uk>,
	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Theodore Ts'o <tytso@....edu>
Subject: Re: fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken
 [PATCH]

On Tue, Mar 29, 2016 at 03:46:45PM -0800, Kent Overstreet wrote:
> On Wed, Mar 30, 2016 at 09:37:43AM +1100, Dave Chinner wrote:
> > There are two locks the XFS_IOLOCK for read/write/splice IO path vs
> > truncate/fallocate exclusion, and XFS_MMAPLOCK for page fault vs
> > truncate/fallocate exclusion.
> 
> Ok, that makes sense then.
> 
> > > ext4 uses the generic code in all the places you're hooking into though -
> > > .fault, .read_iter, etc.
> > > 
> > > The scheme I've got in this patch should perform quite a bit better than what
> > > you're doing - only locking in the slow cache miss path, vs. every time you
> > > touch the page cache.
> > 
> > I'm not sure I follow - how does this work when a fallocate
> > operation use the page cache for, say, zeroing data blocks rather
> > than invalidating them (e.g.  FALLOC_FL_ZERO_RANGE can validly zero
> > blocks through the page cache, so can hole punching)?  Won't the
> > buffered read then return a mix of real and zeroed data, depending
> > who wins the race to each underlying page lock?
> 
> Yes, but does that matter?

IMO, yes. Do we really want a read returning partial data that might
be zeros or even data from different offsets in the file as a shift
is in progress?

> I don't know of any requirement that the entire
> fallocate be atomic, and writes > PAGE_SIZE aren't generally atomic wrt reads
> either (but looking at the code it looks like XFS is special there too...)

POSIX says writes should be atomic w.r.t. to other data accesses.
i.e. that a concurrent overlapping read should not see data that has
been partially written. And, yes, XFS is the only Linux filesystem
that actually complies with that requirement - page-based locking
does not provide this.

Even though there is no standard to define behaviour, fallocate
should behave the same way with respect to user data accesses. Reads
that return partial data from fallocate are just as bad as reads
that return partial data from write. Imagine if we treated truncate
like that...

And therein lies the issue - much of what fallocate does is
equivalent to truncate - it can extend files, it can remove data
from files, it can move data around in files. This stuff needs to be
atomic w.r.t. other user data accesses precisely because of the way
they manipulate user data in the file.

And there are other implementation based reasons, too. For example:
we cannot modify the underlying storage mappings to data while there
is IO in progress over that range as the moment we free the block,
it could be reallocated to something else. Hence if there is IO in
flight while we are freeing blocks we have a potential
use-after-free race condition. Truncate has this exact same issue,
and it really has nothing to do with page cache level interactions.

IOWs, when we directly manipulate the block mapping of an inode
outside of the IO path, we have to a) stop data IO submission and
page faults, b) wait for existing IO to drain, c) invalidate cached
data over the range to be modified, and d) modify the block mapping
of the inode, e) commit the change, and finally f) allow IO to
restart.

AFAICT, your patches do not completely address a) because they only
block page cache insertion (e.g. won't stop direct IO submission)
and they don't do b) at all. Hence it doesn't create an IO barrier
and so I don't see how it can replace the existing IO vs block
map synchronisation mechanisms we already have.

Yes, it will allow us to ensure mmap/direct IO coherency because we
can block page faults while DIO is in progress, but AFAICT it
doesn't prevent us from submitting more DIO on the inode at the same
time...

> > i.e. if the locking only occurs in the page insert slow path, then
> > it doesn't provide sufficient exclusion for extent manipulation
> > operations that use the page cache during their normal operation.
> > IOWs, other, higher level synchronisation methods for fallocate
> > are still necessary....
> 
> Well, I would argue it _is_ enough for internal consistency, if you want to
> provide stronger atomicity guarantees to userspace (not that there's anything
> wrong with that) I think I'd argue that really that's a different issue.
> 
> Also the situation seems to be that XFS provides stronger guarantees than any
> other Linux filesystem, which is sucky from an application point of view because
> they don't want to have to depend on running on XFS for correctness.

Sure, but that's the reality that people have lived with for a long
time, especially if they have an app that needs direct IO
concurrency (e.g. databases) or large amounts of data to be
stored...

> It seems
> like it'd actually be pretty easy to lift your locking scheme to the generic VFS
> code, and then possibly even have it be controlled with a per inode flag so that
> applications that don't need the locking aren't paying for it.

Slowly, slowly starting to happen:

commit 5955102c9984fa081b2d570cfac75c97eecf8f3b
Author: Al Viro <viro@...iv.linux.org.uk>
Date:   Fri Jan 22 15:40:57 2016 -0500

    wrappers for ->i_mutex access
    
    parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).
    
    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.
    
    Signed-off-by: Al Viro <viro@...iv.linux.org.uk>

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com