[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160511192632.GA8841@linux.intel.com>
Date: Wed, 11 May 2016 13:26:32 -0600
From: Ross Zwisler <ross.zwisler@...ux.intel.com>
To: Jan Kara <jack@...e.cz>
Cc: Ross Zwisler <ross.zwisler@...ux.intel.com>,
linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
linux-mm@...ck.org, Dan Williams <dan.j.williams@...el.com>,
linux-nvdimm@...ts.01.org, Matthew Wilcox <willy@...ux.intel.com>
Subject: Re: [PATCH 16/18] dax: New fault locking
On Tue, May 10, 2016 at 02:27:15PM +0200, Jan Kara wrote:
> On Thu 05-05-16 22:13:50, Ross Zwisler wrote:
> > On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> > > +/*
> > > + * Find radix tree entry at given index. If it points to a page, return with
> > > + * the page locked. If it points to the exceptional entry, return with the
> > > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > > + * create empty exceptional entry for the index and return with it locked.
> > > + *
> > > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > > + * persistent memory the benefit is doubtful. We can add that later if we can
> > > + * show it helps.
> > > + */
> > > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > + void *ret, **slot;
> > > +
> > > +restart:
> > > + spin_lock_irq(&mapping->tree_lock);
> > > + ret = get_unlocked_mapping_entry(mapping, index, &slot);
> > > + /* No entry for given index? Make sure radix tree is big enough. */
> > > + if (!ret) {
> > > + int err;
> > > +
> > > + spin_unlock_irq(&mapping->tree_lock);
> > > + err = radix_tree_preload(
> > > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >
> > In the conversation about v2 of this series you said:
> >
> > > Note that we take the hit for dropping the lock only if we really need to
> > > allocate new radix tree node so about once per 64 new entries. So it is not
> > > too bad.
> >
> > I think this is incorrect. We get here whenever we get a NULL return from
> > __radix_tree_lookup(). I believe that this happens if we don't have a node,
> > in which case we need an allocation, but I think it also happens in the case
> > where we do have a node and we just have a NULL slot in that node.
> >
> > For the behavior you're looking for (only preload if you need to do an
> > allocation), you probably need to check the 'slot' we get back from
> > get_unlocked_mapping_entry(), yea?
>
> You are correct. However currently __radix_tree_lookup() doesn't return a
> slot pointer if entry was not found so it is not easy to fix. So I'd leave
> the code as is for now and we can later optimize the case where we don't
> need to grow the radix tree...
Ah, you're right. Sure, that plan sounds good.
> > > +/*
> > > + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> > > + * entry to get unlocked before deleting it.
> > > + */
> > > +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > + void *entry;
> > > +
> > > + spin_lock_irq(&mapping->tree_lock);
> > > + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > + /*
> > > + * Caller should make sure radix tree modifications don't race and
> > > + * we have seen exceptional entry here before.
> > > + */
> > > + if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
> >
> > dax_delete_mapping_entry() is only called from clear_exceptional_entry().
> > With this new code we've changed the behavior of that call path a little.
> >
> > In the various places where clear_exceptional_entry() is called, the code
> > batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We
> > don't hold the mapping->tree_lock between the time this lookup happens and the
> > time that the entry is passed to clear_exceptional_entry(). This is why the
> > old code did a verification that the entry passed in matched what was still
> > currently present in the radix tree. This was done in the DAX case via
> > radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
> > for the page cache case. In both cases if the entry didn't match what was
> > currently in the tree, we bailed without doing anything.
> >
> > This new code doesn't verify against the 'entry' passed to
> > clear_exceptional_entry(), but instead makes sure it is an exceptional entry
> > before removing, and if not it does a WARN_ON_ONCE().
> >
> > This changes things because:
> >
> > a) If the exceptional entry changed, say from a plain lock entry to an actual
> > DAX entry, we wouldn't notice, and we would just clear the latter out. My
> > guess is that this is fine, I just wanted to call it out.
> >
> > b) If we have a non-exceptional entry here now, say because our lock entry has
> > been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
> > removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this
> > could happen during normal operation and we don't want to scare anyone. :)
>
> So your concerns are exactly why I have added a comment to
> dax_delete_mapping_entry() that:
>
> /*
> * Caller should make sure radix tree modifications don't race and
> * we have seen exceptional entry here before.
> */
>
> The thing is dax_delete_mapping_entry() is called only from truncate /
> punch hole path. Those should hold i_mmap_sem for writing and thus there
> should be no modifications of the radix tree. If anything changes, between
> what truncate_inode_pages() (or similar functions) finds and what
> dax_delete_mapping_entry() sees, we have a locking bug and I want to know
> about it :). Any suggestion how I should expand the comment so that this is
> clearer?
Ah, I didn't understand all that. :) Given a bit more context the comment
seems fine - if anything it could be a bit more specific, and include the
text: "dax_delete_mapping_entry() is called only from truncate / punch hole
path. Those should hold i_mmap_sem for writing and thus there should be no
modifications of the radix tree." Either way - thanks for explaining.
> > > +/*
> > > * The user has performed a load from a hole in the file. Allocating
> > > * a new page in the file would cause excessive storage usage for
> > > * workloads with sparse files. We allocate a page cache page instead.
> > > @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
> > > * otherwise it will simply fall out of the page cache under memory
> > > * pressure without ever having been dirtied.
> > > */
> > > -static int dax_load_hole(struct address_space *mapping, struct page *page,
> > > - struct vm_fault *vmf)
> > > +static int dax_load_hole(struct address_space *mapping, void *entry,
> > > + struct vm_fault *vmf)
> > > {
> > > - if (!page)
> > > - page = find_or_create_page(mapping, vmf->pgoff,
> > > - GFP_KERNEL | __GFP_ZERO);
> > > - if (!page)
> > > - return VM_FAULT_OOM;
> > > + struct page *page;
> > > +
> > > + /* Hole page already exists? Return it... */
> > > + if (!radix_tree_exceptional_entry(entry)) {
> > > + vmf->page = entry;
> > > + return VM_FAULT_LOCKED;
> > > + }
> > >
> > > + /* This will replace locked radix tree entry with a hole page */
> > > + page = find_or_create_page(mapping, vmf->pgoff,
> > > + vmf->gfp_mask | __GFP_ZERO);
> >
> > This replacement happens via page_cache_tree_insert(), correct? In this case,
> > who wakes up anyone waiting on the old lock entry that we just killed? In the
> > non-hole case we would traverse through put_locked_mapping_entry(), but I
> > don't see that in the hole case.
>
> Ha, good catch. We miss the wakeup. Fixed.
>
> Attached is the diff resulting from your review of this patch. I still have
> to hunt down that strange interaction with workingset code you've reported...
At the end of this mail I've attached one small fixup for the incremental diff
you sent. Aside from that, I think that you've addressed all my review
feedback, thanks!
Reviewed-by: Ross Zwisler <ross.zwisler@...ux.intel.com>
I'm going to try and get more info on the working set test failure.
---
diff --git a/fs/dax.c b/fs/dax.c
index f496854..c4cb69b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -406,7 +406,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
init_wait(&ewait.wait);
ewait.wait.func = wake_exceptional_entry_func;
- ewait.key.root = &mapping->page_tree;
+ ewait.key.mapping = mapping;
ewait.key.index = index;
for (;;) {
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists