linux-kernel - fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160329042546.GA10581@kmo-pixel>
Date:	Mon, 28 Mar 2016 20:25:46 -0800
From:	Kent Overstreet <kent.overstreet@...il.com>
To:	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Cc:	Al Viro <viro@...iv.linux.org.uk>, Rik van Riel <riel@...hat.com>,
	Dave Chinner <david@...morbit.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Theodore Ts'o <tytso@....edu>
Subject: fallocate INSERT_RANGE/COLLAPSE_RANGE is completely broken [PATCH]

Bit of previous discussion:
http://thread.gmane.org/gmane.linux.file-systems/101201/

The underlying issue is that we have no mechanism for invalidating a range of
the pagecache and then _keeping it invalidated_ while we Do Stuff. 

The fallocate INSERT_RANGE/COLLAPSE_RANGE situation seems likely to be worse
than I initially thought. I've been digging into this in the course of bcachefs
testing - I was hitting assertions that meant state hanging off the page cache
(in this case, allocation information, i.e. whether we needed to reserve space
on write) was inconsistent with the btree in writepages().

Well, bcachefs isn't the only filesystem that hangs additional state off the
pagecache, and the situation today is that an unpriviliged user can cause
inconsistencies there by just doing buffered reads concurrently with
INSERT_RANGE/COLLAPSE_RANGE. I highly highly doubt this is an issue of just
"oops, you corrupted your file because you were doing stupid stuff" - who knows
what internal invariants are getting broken here, and I don't particularly care
to find out.

I'm pretty certain that ext4 and xfs are both broken here - I haven't dug into
the btrfs code. The impression I get from reading the code is that whoever wrote
it was cripping off of truncate - except that truncate relies on the i_size
change to work; without that there's no way this code can possibly work.

That's the bad news. Good news is, I think it's fixable.

My solution is just to add a lock (per address_space) around adding pages to the
pagecache - it's got some funny semantics so as to be usable for dio writes, but
it's just a lock. (It's got two states like an rwlock, except that both states
are shared and only exclusive with the other states; dio writes don't block
other dio writes, buffered IO adding pages to the pagecache does't block other
pages being added).

The only tricky bit is the recursion Dave noted in the dio write path -
get_user_pages() -> fault(). My solution is to just keep track of what address
space we have locked (blocking page adds) in task_struct, then A) don't deadlock
in __add_to_page_cache_locked(), and B) don't do readahead in filemap_fault()
(which means we can delete another hack from the dio write code).

I don't particularly care for where the add side of the lock is taken - it
works, but it means we can't distinguish accidental recursion from intentional.
I'm open to ideas there. Also, we made better use of the add side of the lock we
could probably delete a lot of fragile code for synchronization with truncate.

Patch below is reliably passing xfstests for me (where previously I was hitting
the aforementioned pagecache inconsistency assertions). I've only hooked up
bcachefs so far, other filesystems should be pretty trivial (but it does need
individual filesystems to be converted).

One issue I haven't dealt with is that dio writes can starve buffered IO/mmapped
IO with this lock - not sure if that's worth dealing with.

Patch below includes bcachefs changes, but is otherwise on 4.5. Available in my
usual git repo:

https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev-pagecache-lock

-- >8 --
Subject: [PATCH] mm: pagecache add lock

Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet <kent.overstreet@...il.com>
---
 drivers/md/bcache/fs-io.c | 197 +++++++++++++++++++++++++---------------------
 fs/inode.c                |   1 +
 include/linux/fs.h        |  23 ++++++
 include/linux/init_task.h |   1 +
 include/linux/sched.h     |   4 +
 mm/filemap.c              |  85 ++++++++++++++++++--
 6 files changed, 217 insertions(+), 94 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 69b8b526c1..617c61e070 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -345,6 +345,7 @@ void address_space_init_once(struct address_space *mapping)
 {
 	memset(mapping, 0, sizeof(*mapping));
 	INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
+	pagecache_lock_init(&mapping->add_lock);
 	spin_lock_init(&mapping->tree_lock);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100210..a806c09c21 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -424,9 +424,32 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+	atomic_long_t		v;
+	wait_queue_head_t	wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+	atomic_long_set(&lock->v, 0);
+	init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
+	struct pagecache_lock	add_lock;	/* protects adding new pages */
 	spinlock_t		tree_lock;	/* and lock protecting it */
 	atomic_t		i_mmap_writable;/* count VM_SHARED mappings */
 	struct rb_root		i_mmap;		/* tree of private and shared mappings */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4551..1bded65b89 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -235,6 +235,7 @@ extern struct task_group root_task_group;
 		.signal = {{0}}},					\
 	.blocked	= {{0}},					\
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(tsk.alloc_lock),		\
+	.pagecache_lock = NULL,						\
 	.journal_info	= NULL,						\
 	.cpu_timers	= INIT_CPU_TIMERS(tsk.cpu_timers),		\
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(tsk.pi_lock),	\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ea5e7c5ce6..48762b6b75 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -134,6 +134,7 @@ struct perf_event_context;
 struct blk_plug;
 struct filename;
 struct nameidata;
+struct pagecache_lock;
 
 #define VMACACHE_BITS 2
 #define VMACACHE_SIZE (1U << VMACACHE_BITS)
@@ -1649,6 +1650,9 @@ struct task_struct {
 	unsigned int in_ubsan;
 #endif
 
+	/* currently held lock, for avoiding recursing in fault path: */
+	struct pagecache_lock *pagecache_lock;
+
 /* journalling filesystem info */
 	void *journal_info;
 
diff --git a/mm/filemap.c b/mm/filemap.c
index eae0d353d1..c9c24ceed9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,6 +110,67 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+	BUG_ON(atomic_long_read(&lock->v) == 0);
+
+	if (atomic_long_sub_return_release(i, &lock->v) == 0)
+		wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+	long v = atomic_long_read(&lock->v), old;
+
+	do {
+		old = v;
+
+		if (i > 0 ? v < 0 : v > 0)
+			return false;
+	} while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+					old, old + i)) != old);
+	return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+	wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, 1);
+}
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, 1);
+}
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, -1);
+}
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+}
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+	BUG_ON(current->pagecache_lock != lock);
+	current->pagecache_lock = NULL;
+	__pagecache_lock_put(lock, -1);
+}
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+	BUG_ON(current->pagecache_lock);
+	current->pagecache_lock = lock;
+}
+
 static void page_cache_tree_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -631,18 +692,21 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
+
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
 		if (error)
-			return error;
+			goto err;
 	}
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
 			mem_cgroup_cancel_charge(page, memcg, false);
-		return error;
+		goto err;
 	}
 
 	page_cache_get(page);
@@ -662,7 +726,11 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
+err:
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
+
+	return error;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
@@ -670,7 +738,7 @@ err_insert:
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
-	return error;
+	goto err;
 }
 
 /**
@@ -1771,7 +1839,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * Do we have something in the page cache already?
 	 */
 	page = find_get_page(mapping, offset);
-	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
+	if (unlikely(current->pagecache_lock == &mapping->add_lock)) {
+		/*
+		 * fault from e.g. dio -> get_user_pages() - _don't_ want to do
+		 * readahead, only read in page we need:
+		 */
+		if (!page)
+			goto no_cached_page;
+	} else if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
 		/*
 		 * We found the page, so try async readahead before
 		 * waiting for the lock.
diff --git a/drivers/md/bcache/fs-io.c b/drivers/md/bcache/fs-io.c
index 24cd1297ef..96642f2ef0 100644
--- a/drivers/md/bcache/fs-io.c
+++ b/drivers/md/bcache/fs-io.c
@@ -25,6 +25,36 @@ struct bio_set *bch_writepage_bioset;
 struct bio_set *bch_dio_read_bioset;
 struct bio_set *bch_dio_write_bioset;
 
+/* pagecache_block must be held */
+static int write_invalidate_inode_pages_range(struct address_space *mapping,
+					      loff_t start, loff_t end)
+{
+	int ret;
+
+	/*
+	 * XXX: the way this is currently implemented, we can spin if a process
+	 * is continually redirtying a specific page
+	 */
+	do {
+		if (!mapping->nrpages &&
+		    !mapping->nrexceptional)
+			return 0;
+
+		ret = filemap_write_and_wait_range(mapping, start, end);
+		if (ret)
+			break;
+
+		if (!mapping->nrpages)
+			return 0;
+
+		ret = invalidate_inode_pages2_range(mapping,
+				start >> PAGE_CACHE_SHIFT,
+				end >> PAGE_CACHE_SHIFT);
+	} while (ret == -EBUSY);
+
+	return ret;
+}
+
 /* i_size updates: */
 
 /*
@@ -1085,13 +1115,16 @@ int bch_write_begin(struct file *file, struct address_space *mapping,
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
 	struct page *page;
-	int ret = 0;
+	int ret = -ENOMEM;
 
 	BUG_ON(inode_unhashed(mapping->host));
 
+	/* Not strictly necessary - same reason as mkwrite(): */
+	pagecache_add_get(&mapping->add_lock);
+
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
-		return -ENOMEM;
+		goto err_unlock;
 
 	if (PageUptodate(page))
 		goto out;
@@ -1132,11 +1165,13 @@ out:
 	}
 
 	*pagep = page;
-	return ret;
+	return 0;
 err:
 	unlock_page(page);
 	page_cache_release(page);
 	*pagep = NULL;
+err_unlock:
+	pagecache_add_put(&mapping->add_lock);
 	return ret;
 }
 
@@ -1211,6 +1246,7 @@ int bch_write_end(struct file *filp, struct address_space *mapping,
 out:
 	unlock_page(page);
 	page_cache_release(page);
+	pagecache_add_put(&mapping->add_lock);
 
 	return copied;
 }
@@ -1331,7 +1367,9 @@ out:
 
 static long __bch_dio_write_complete(struct dio_write *dio)
 {
-	struct inode *inode = dio->req->ki_filp->f_inode;
+	struct file *file = dio->req->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = file->f_inode;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	long ret = dio->error ?: dio->written;
@@ -1340,7 +1378,8 @@ static long __bch_dio_write_complete(struct dio_write *dio)
 
 	i_sectors_dirty_put(ei, &dio->i_sectors_hook);
 
-	inode_dio_end(dio->req->ki_filp->f_inode);
+	__pagecache_block_put(&mapping->add_lock);
+	inode_dio_end(inode);
 
 	if (dio->iovec && dio->iovec != dio->inline_vecs)
 		kfree(dio->iovec);
@@ -1425,12 +1464,17 @@ static void bch_dio_write_loop_async(struct closure *cl)
 {
 	struct dio_write *dio =
 		container_of(cl, struct dio_write, cl);
+	struct address_space *mapping = dio->req->ki_filp->f_mapping;
 
 	bch_dio_write_done(dio);
 
 	if (dio->iter.count && !dio->error) {
 		use_mm(dio->mm);
+		pagecache_block_get(&mapping->add_lock);
+
 		bch_do_direct_IO_write(dio, false);
+
+		pagecache_block_put(&mapping->add_lock);
 		unuse_mm(dio->mm);
 
 		continue_at(&dio->cl,
@@ -1450,6 +1494,7 @@ static int bch_direct_IO_write(struct cache_set *c, struct kiocb *req,
 			       struct file *file, struct inode *inode,
 			       struct iov_iter *iter, loff_t offset)
 {
+	struct address_space *mapping = file->f_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct dio_write *dio;
 	struct bio *bio;
@@ -1504,6 +1549,7 @@ static int bch_direct_IO_write(struct cache_set *c, struct kiocb *req,
 	closure_init(&dio->cl, NULL);
 
 	inode_dio_begin(inode);
+	__pagecache_block_get(&mapping->add_lock);
 
 	/*
 	 * appends are sync in order to do the i_size update under
@@ -1595,67 +1641,30 @@ ssize_t bch_direct_IO(struct kiocb *req, struct iov_iter *iter, loff_t offset)
 static ssize_t
 bch_direct_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos)
 {
-	struct file	*file = iocb->ki_filp;
+	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	ssize_t		written;
-	size_t		write_len;
-	pgoff_t		end;
+	ssize_t	ret;
 
-	write_len = iov_iter_count(from);
-	end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT;
+	pagecache_block_get(&mapping->add_lock);
 
-	written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1);
-	if (written)
-		goto out;
-
-	/*
-	 * After a write we want buffered reads to be sure to go to disk to get
-	 * the new data.  We invalidate clean cached page from the region we're
-	 * about to write.  We do this *before* the write so that we can return
-	 * without clobbering -EIOCBQUEUED from ->direct_IO().
-	 */
-	if (mapping->nrpages) {
-		written = invalidate_inode_pages2_range(mapping,
-					pos >> PAGE_CACHE_SHIFT, end);
-		/*
-		 * If a page can not be invalidated, return 0 to fall back
-		 * to buffered write.
-		 */
-		if (written) {
-			if (written == -EBUSY)
-				return 0;
-			goto out;
-		}
-	}
+	/* Write and invalidate pagecache range that we're writing to: */
+	ret = write_invalidate_inode_pages_range(file->f_mapping, pos,
+					pos + iov_iter_count(from) - 1);
+	if (ret)
+		goto err;
 
-	written = mapping->a_ops->direct_IO(iocb, from, pos);
+	ret = bch_direct_IO(iocb, from, pos);
+err:
+	pagecache_block_put(&mapping->add_lock);
 
-	/*
-	 * Finally, try again to invalidate clean pages which might have been
-	 * cached by non-direct readahead, or faulted in by get_user_pages()
-	 * if the source of the write was an mmap'ed region of the file
-	 * we're writing.  Either one is a pretty crazy thing to do,
-	 * so we don't support it 100%.  If this invalidation
-	 * fails, tough, the write still worked...
-	 *
-	 * Augh: this makes no sense for async writes - the second invalidate
-	 * has to come after the new data is visible. But, we can't just move it
-	 * to the end of the dio write path - for async writes we don't have
-	 * i_mutex held anymore, 
-	 */
-	if (mapping->nrpages) {
-		invalidate_inode_pages2_range(mapping,
-					      pos >> PAGE_CACHE_SHIFT, end);
-	}
-out:
-	return written;
+	return ret;
 }
 
 static ssize_t __bch_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
-	struct address_space * mapping = file->f_mapping;
-	struct inode 	*inode = mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	ssize_t	ret;
 
 	/* We can write back this queue in page reclaim */
@@ -1713,10 +1722,13 @@ int bch_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	file_update_time(vma->vm_file);
 
 	/*
-	 * i_mutex is required for synchronizing with fcollapse(), O_DIRECT
-	 * writes
+	 * Not strictly necessary, but helps avoid dio writes livelocking in
+	 * write_invalidate_inode_pages_range() - can drop this if/when we get
+	 * a write_invalidate_inode_pages_range() that works without dropping
+	 * page lock before invalidating page
 	 */
-	mutex_lock(&inode->i_mutex);
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
 
 	lock_page(page);
 	if (page->mapping != mapping ||
@@ -1736,7 +1748,8 @@ int bch_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		set_page_dirty(page);
 	wait_for_stable_page(page);
 out:
-	mutex_unlock(&inode->i_mutex);
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
@@ -1744,8 +1757,8 @@ out:
 void bch_invalidatepage(struct page *page, unsigned int offset,
 			unsigned int length)
 {
-	BUG_ON(!PageLocked(page));
-	BUG_ON(PageWriteback(page));
+	EBUG_ON(!PageLocked(page));
+	EBUG_ON(PageWriteback(page));
 
 	if (offset || length < PAGE_CACHE_SIZE)
 		return;
@@ -1755,12 +1768,13 @@ void bch_invalidatepage(struct page *page, unsigned int offset,
 
 int bch_releasepage(struct page *page, gfp_t gfp_mask)
 {
-	BUG_ON(!PageLocked(page));
-	BUG_ON(PageWriteback(page));
-	BUG_ON(PageDirty(page));
+	EBUG_ON(!PageLocked(page));
+	EBUG_ON(PageWriteback(page));
 
-	bch_clear_page_bits(page);
+	if (PageDirty(page))
+		return 0;
 
+	bch_clear_page_bits(page);
 	return 1;
 }
 
@@ -1935,6 +1949,7 @@ static int bch_truncate_page(struct address_space *mapping, loff_t from)
 
 int bch_truncate(struct inode *inode, struct iattr *iattr)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct i_size_update *u;
@@ -1943,6 +1958,7 @@ int bch_truncate(struct inode *inode, struct iattr *iattr)
 	int ret = 0;
 
 	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	mutex_lock(&ei->update_lock);
 
@@ -1986,7 +2002,7 @@ int bch_truncate(struct inode *inode, struct iattr *iattr)
 	 * just put it, because it actually is still dirty
 	 */
 	if (unlikely(ret))
-		return ret;
+		goto err;
 
 	/*
 	 * truncate_setsize() does the i_size_write(), can't use
@@ -2007,12 +2023,12 @@ int bch_truncate(struct inode *inode, struct iattr *iattr)
 
 		ret = i_sectors_dirty_get(ei, &i_sectors_hook);
 		if (unlikely(ret))
-			return ret;
+			goto err;
 
 		ret = bch_truncate_page(inode->i_mapping, iattr->ia_size);
 		if (unlikely(ret)) {
 			i_sectors_dirty_put(ei, &i_sectors_hook);
-			return ret;
+			goto err;
 		}
 
 		ret = bch_inode_truncate(c, inode->i_ino,
@@ -2023,18 +2039,24 @@ int bch_truncate(struct inode *inode, struct iattr *iattr)
 		i_sectors_dirty_put(ei, &i_sectors_hook);
 
 		if (unlikely(ret))
-			return ret;
+			goto err;
 	}
 
 	setattr_copy(inode, iattr);
 
+	pagecache_block_put(&mapping->add_lock);
+
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	i_size_update_put(c, ei, idx, 1);
 	return 0;
+err:
+	pagecache_block_put(&mapping->add_lock);
+	return ret;
 }
 
 static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	u64 ino = inode->i_ino;
@@ -2043,6 +2065,9 @@ static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 	int ret = 0;
 
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
+
 	ret = __bch_truncate_page(inode->i_mapping,
 				  offset >> PAGE_CACHE_SHIFT,
 				  offset, offset + len);
@@ -2078,6 +2103,7 @@ static long bch_fpunch(struct inode *inode, loff_t offset, loff_t len)
 		i_sectors_dirty_put(ei, &i_sectors_hook);
 	}
 out:
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return ret;
@@ -2085,6 +2111,7 @@ out:
 
 static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct btree_iter src;
@@ -2111,11 +2138,10 @@ static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 	 * btree, and the btree consistent with i_size - we don't need outside
 	 * locking for the extents btree itself, because we're using linked
 	 * iterators
-	 *
-	 * XXX: hmm, need to prevent reads adding things to the pagecache until
-	 * we're done?
 	 */
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	ret = -EINVAL;
 	if (offset + len >= inode->i_size)
@@ -2126,19 +2152,8 @@ static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 
 	new_size = inode->i_size - len;
 
-	inode_dio_wait(inode);
-
-	do {
-		ret = filemap_write_and_wait_range(inode->i_mapping,
-						   offset, LLONG_MAX);
-		if (ret)
-			goto err;
-
-		ret = invalidate_inode_pages2_range(inode->i_mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					ULONG_MAX);
-	} while (ret == -EBUSY);
-
+	ret = write_invalidate_inode_pages_range(inode->i_mapping,
+						 offset, LLONG_MAX);
 	if (ret)
 		goto err;
 
@@ -2227,6 +2242,7 @@ static long bch_fcollapse(struct inode *inode, loff_t offset, loff_t len)
 
 	mutex_unlock(&ei->update_lock);
 
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return ret;
@@ -2239,6 +2255,7 @@ err_unwind:
 err:
 	bch_btree_iter_unlock(&src);
 	bch_btree_iter_unlock(&dst);
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 	return ret;
 }
@@ -2246,6 +2263,7 @@ err:
 static long bch_fallocate(struct inode *inode, int mode,
 				    loff_t offset, loff_t len)
 {
+	struct address_space *mapping = inode->i_mapping;
 	struct bch_inode_info *ei = to_bch_ei(inode);
 	struct cache_set *c = inode->i_sb->s_fs_info;
 	struct i_sectors_hook i_sectors_hook;
@@ -2261,6 +2279,8 @@ static long bch_fallocate(struct inode *inode, int mode,
 	bch_btree_iter_init_intent(&iter, c, BTREE_ID_EXTENTS, POS_MIN);
 
 	mutex_lock(&inode->i_mutex);
+	inode_dio_wait(inode);
+	pagecache_block_get(&mapping->add_lock);
 
 	if (!(mode & FALLOC_FL_KEEP_SIZE) &&
 	    new_size > inode->i_size) {
@@ -2270,9 +2290,6 @@ static long bch_fallocate(struct inode *inode, int mode,
 	}
 
 	if (mode & FALLOC_FL_ZERO_RANGE) {
-		/* just for __bch_truncate_page(): */
-		inode_dio_wait(inode);
-
 		ret = __bch_truncate_page(inode->i_mapping,
 					  offset >> PAGE_CACHE_SHIFT,
 					  offset, offset + len);
@@ -2374,6 +2391,7 @@ static long bch_fallocate(struct inode *inode, int mode,
 		i_size_update_put(c, ei, idx, 1);
 	}
 
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
@@ -2381,6 +2399,7 @@ err_put_sectors_dirty:
 	i_sectors_dirty_put(ei, &i_sectors_hook);
 err:
 	bch_btree_iter_unlock(&iter);
+	pagecache_block_put(&mapping->add_lock);
 	mutex_unlock(&inode->i_mutex);
 	return ret;
 }
-- 
2.8.0.rc3