linux-ext4 - Re: Subtle races between DAX mmap fault and write path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160727221949.GU16044@dastard>
Date:	Thu, 28 Jul 2016 08:19:49 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Ross Zwisler <ross.zwisler@...ux.intel.com>,
	Jan Kara <jack@...e.cz>, linux-fsdevel@...r.kernel.org,
	linux-nvdimm@...ts.01.org, xfs@....sgi.com,
	linux-ext4@...r.kernel.org, Dan Williams <dan.j.williams@...el.com>
Subject: Re: Subtle races between DAX mmap fault and write path

On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> > 
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).

The XFS update I just pushed to Linus contains a rework of the XFS
DAX IO path. It no longer shares the XFS direct IO path, so it
doesn't contain any of the direct IO warts anymore.

> > We have a file 'f' which has a hole at offset 0.
> > 
> > Process 0				Process 1
> > 
> > data = mmap('f');
> > read data[0]
> >   -> fault, we map a hole page
> > 
> > 					pwrite('f', buf, len, 0)
> > 					  -> ext4_file_write_iter
> > 					    inode_lock(inode);
> > 					    __generic_file_write_iter()
> > 					      generic_file_direct_write()
> > 						invalidate_inode_pages2_range()
> > 						  - drops hole page from
> > 						    the radix tree
> > 						ext4_direct_IO()
> > 						  dax_do_io()
> > 						    - allocates block for
> > 						      offset 0
> > data[0] = 1
> >   -> page_mkwrite fault
> >     -> ext4_dax_fault()
> >       down_read(&EXT4_I(inode)->i_mmap_sem);
> >       __dax_fault()
> > 	grab_mapping_entry()
> > 	  - creates locked radix tree entry
> > 	- maps block into PTE
> > 	put_locked_mapping_entry()
> > 
> > 						invalidate_inode_pages2_range()
> > 						  - removes dax entry from
> > 						    the radix tree

In XFS, we don't call __generic_file_write_iter or
generic_file_direct_write(), and xfs_file_dax_write() does not have
this trailing call to invalidate_inode_pages2_range() anymore. It's
DAX - there's nothing to invalidate, right?

So I think Christoph just (accidentally) removed this race condition
from XFS....

> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> > 
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.

I don't think it does - what, in DAX, is incoherent? If anything,
it's the data in the direct IO buffer, not the view the mmap will
see. i.e. the post-write invalidate is to ensure that applications
that have mmapped the file see the data written by direct IO. That's
not necessary with DAX.

> > The question is how to best fix this. I see three options:
> > 
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.

We don't call filemap_write_and_wait() in xfs_file_dax_write()
anymore, either. It's DAX - we don't need to flush anything to read
the correct data, and there's nothing cached that becomes stale when
we overwrite the destination memory.

[snip]

> > Any opinions on this?
> 
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?

The first invalidate is still there in XFS. The comment above it
in the new XFS code says:

	/*
	 * Yes, even DAX files can have page cache attached to them:  A zeroed
	 * page is inserted into the pagecache when we have to serve a write
	 * fault on a hole.  It should never be dirtied and can simply be
	 * dropped from the pagecache once we get real data for the page.
	 */
	if (mapping->nrpages) {
		ret = invalidate_inode_pages2(mapping);
		WARN_ON_ONCE(ret);
	}


> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens.  For DAX I don't think we care.  If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.

Yup, that's pretty much how XFS treats DAX now - it's not direct IO,
but it's not buffered IO, either...

I don't think there's a problem that needs to be fixed in the DAX
code - the issue is all about the surrounding IO context. i.e
whether do_dax_io() is automatically coherent with mmap or not
because mmap directly exposes the CPU cache coherent memory
do_dax_io() modifies.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html