[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170425111043.GH2793@quack2.suse.cz>
Date: Tue, 25 Apr 2017 13:10:43 +0200
From: Jan Kara <jack@...e.cz>
To: Ross Zwisler <ross.zwisler@...ux.intel.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org,
Alexander Viro <viro@...iv.linux.org.uk>,
Alexey Kuznetsov <kuznet@...tuozzo.com>,
Andrey Ryabinin <aryabinin@...tuozzo.com>,
Anna Schumaker <anna.schumaker@...app.com>,
Christoph Hellwig <hch@....de>,
Dan Williams <dan.j.williams@...el.com>,
"Darrick J. Wong" <darrick.wong@...cle.com>,
Eric Van Hensbergen <ericvh@...il.com>,
Jan Kara <jack@...e.cz>, Jens Axboe <axboe@...nel.dk>,
Johannes Weiner <hannes@...xchg.org>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
Latchesar Ionkov <lucho@...kov.net>,
linux-cifs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-mm@...ck.org, linux-nfs@...r.kernel.org,
linux-nvdimm@...ts.01.org, Matthew Wilcox <mawilcox@...rosoft.com>,
Ron Minnich <rminnich@...dia.gov>,
samba-technical@...ts.samba.org, Steve French <sfrench@...ba.org>,
Trond Myklebust <trond.myklebust@...marydata.com>,
v9fs-developer@...ts.sourceforge.net
Subject: Re: [PATCH 2/2] dax: fix data corruption due to stale mmap reads
On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
>
> - open an mmap over a 2MiB hole
>
> - read from a 2MiB hole, faulting in a 2MiB zero page
>
> - write to the hole with write(3p). The write succeeds but we incorrectly
> leave the 2MiB zero page mapping intact.
>
> - via the mmap, read the data that was just written. Since the zero page
> mapping is still intact we read back zeroes instead of the new data.
>
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
>
> This is based on an initial patch from Jan Kara.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
> Reported-by: Jan Kara <jack@...e.cz>
> Cc: <stable@...r.kernel.org> [4.10+]
> ---
> fs/dax.c | 26 +++++++++++++++++++-------
> 1 file changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
> pgoff_t index, bool trunc)
> {
> int ret = 0;
> - void *entry;
> + void *entry, **slot;
> struct radix_tree_root *page_tree = &mapping->page_tree;
>
> spin_lock_irq(&mapping->tree_lock);
> - entry = get_unlocked_mapping_entry(mapping, index, NULL);
> + entry = get_unlocked_mapping_entry(mapping, index, &slot);
> if (!entry || !radix_tree_exceptional_entry(entry))
> goto out;
> if (!trunc &&
> (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
> radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
> goto out;
> +
> + /*
> + * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> + * do the unmap_mapping_range() call.
> + */
> + entry = lock_slot(mapping, slot);
This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> + (loff_t)PAGE_SIZE << dax_radix_order(entry), 0);
Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.
E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?
Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:
CPU1 - write(2) CPU2 - read fault
dax_iomap_pte_fault()
->iomap_begin() - sees hole
dax_iomap_rw()
iomap_apply()
->iomap_begin - allocates blocks
dax_iomap_actor()
invalidate_inode_pages2_range()
- there's nothing to invalidate
grab_mapping_entry()
- we add zero page in the radix
tree & map it to page tables
Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?
Honza
> +
> + spin_lock_irq(&mapping->tree_lock);
> radix_tree_delete(page_tree, index);
> mapping->nrexceptional--;
> ret = 1;
> out:
> - put_unlocked_mapping_entry(mapping, index, entry);
> spin_unlock_irq(&mapping->tree_lock);
> + dax_wake_mapping_entry_waiter(mapping, index, entry, true);
> return ret;
> }
> /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> return -EIO;
>
> /*
> - * Write can allocate block for an area which has a hole page mapped
> - * into page tables. We have to tear down these mappings so that data
> - * written by write(2) is visible in mmap.
> + * Write can allocate block for an area which has a hole page or zero
> + * PMD entry in the radix tree. We have to tear down these mappings so
> + * that data written by write(2) is visible in mmap.
> */
> - if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> + if (iomap->flags & IOMAP_F_NEW) {
> invalidate_inode_pages2_range(inode->i_mapping,
> pos >> PAGE_SHIFT,
> (end - 1) >> PAGE_SHIFT);
> --
> 2.9.3
>
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
Powered by blists - more mailing lists