linux-kernel - Re: [PATCH v8 6/9] dax: add support for fsync/msync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160113073019.GB30496@linux.intel.com>
Date:	Wed, 13 Jan 2016 00:30:19 -0700
From:	Ross Zwisler <ross.zwisler@...ux.intel.com>
To:	Jan Kara <jack@...e.cz>
Cc:	Ross Zwisler <ross.zwisler@...ux.intel.com>,
	linux-kernel@...r.kernel.org, "H. Peter Anvin" <hpa@...or.com>,
	"J. Bruce Fields" <bfields@...ldses.org>,
	Theodore Ts'o <tytso@....edu>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@...ger.ca>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Dan Williams <dan.j.williams@...el.com>,
	Dave Chinner <david@...morbit.com>,
	Dave Hansen <dave.hansen@...ux.intel.com>,
	Ingo Molnar <mingo@...hat.com>, Jan Kara <jack@...e.com>,
	Jeff Layton <jlayton@...chiereds.net>,
	Matthew Wilcox <matthew.r.wilcox@...el.com>,
	Matthew Wilcox <willy@...ux.intel.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, linux-nvdimm@...ts.01.org, x86@...nel.org,
	xfs@....sgi.com
Subject: Re: [PATCH v8 6/9] dax: add support for fsync/msync

On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
> 
> Some comments below.
> 
> > ---
> >  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  include/linux/dax.h |   2 +
> >  mm/filemap.c        |   6 ++
> >  3 files changed, 196 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> >  	return 0;
> >  }
> >  
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > +		sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type, error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, index);
> > +
> > +	if (entry) {
> > +		type = RADIX_DAX_TYPE(entry);
> > +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > +					type != RADIX_DAX_PMD)) {
> > +			error = -EIO;
> > +			goto unlock;
> > +		}
> > +
> > +		if (!pmd_entry || type == RADIX_DAX_PMD)
> > +			goto dirty;
> > +		radix_tree_delete(&mapping->page_tree, index);
> > +		mapping->nrexceptional--;
> 
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.

Ah, great catch, I'll address it in v9 which I'll send out tomorrow.

> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
> 
> So we should clear the TOWRITE tag only after the range is flushed.  This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.

Yep, this make sense.  I'll also fix that in v9.

> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...

The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:

https://lkml.org/lkml/2015/11/13/759

I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.

The race goes like this:

Thread 1 (write)			Thread 2 (fsync)
================			================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
					dax_writeback_mapping_range()
					dax_writeback_one()
					radix_tag_clear(DIRTY)
					pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()

After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry.  This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.

The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.

I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock.  I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.

The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.

> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		goto unmap;
> > +	}
> > +
> > +	wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return ret;
> > +
> > + unlock:
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	return ret;
> > +}
> 
> ...
> 
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> 
> Why is NO_SECTOR argument correct here?

Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty".  This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write.  If the fault was for a write, the radix tree entry will also
be made dirty.

For reads the radix tree entry will be inserted but left clean.  When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
tree entry & set the dirty tag.

This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().