[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1271097638.4807.129.camel@twins>
Date: Mon, 12 Apr 2010 20:40:38 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Rik van Riel <riel@...hat.com>, Borislav Petkov <bp@...en8.de>,
Johannes Weiner <hannes@...xchg.org>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Minchan Kim <minchan.kim@...il.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Nick Piggin <npiggin@...e.de>,
Andrea Arcangeli <aarcange@...hat.com>,
Hugh Dickins <hugh.dickins@...cali.co.uk>,
sgunderson@...foot.com
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the
anon_vmas of a mergeable VMA
On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote:
>
> On Mon, 12 Apr 2010, Rik van Riel wrote:
>
> > On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> >
> > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > > __dec_zone_page_state(page, NR_FILE_MAPPED);
> > > mem_cgroup_update_file_mapped(page, -1);
> > > }
> > > - /*
> > > - * It would be tidy to reset the PageAnon mapping here,
> > > - * but that might overwrite a racing page_add_anon_rmap
> > > - * which increments mapcount after us but sets mapping
> > > - * before us: so leave the reset to free_hot_cold_page,
> > > - * and remember that it's only reliable while mapped.
> > > - * Leaving it set also helps swapoff to reinstate ptes
> > > - * faster for those pages still in swapcache.
> > > - */
> > > +
> > > + page->mapping = NULL;
> > > }
> >
> > That would be a bug for file pages :)
> >
> > I could see how it could work for anonymous memory, though.
>
> I think it's scary for anonymous pages too. The _common_ case of
> page_remove_rmap() is from unmap/exit, which holds no locks on the page
> what-so-ever. So assuming the page could be reachable some other way (swap
> cache etc), I think the above is pretty scary.
Fully agreed.
> Also do note that the bug we've been chasing has _always_ had that test
> for "page_mapped(page)". See my other email about why the unmapped case
> isn't even interesting, because it's so easy to see how page->mapping can
> be stale for unmapped pages.
>
> It's the _mapped_ case that is interesting, not the unmapped one. So
> setting page->mapping to NULL when unmapping is perhaps a nice consistency
> issue ("never have stale pointers"), but it's missing the fact that it's
> not really the case we care about.
Yes, I don't think this is the problem that has been plaguing us for
over a week now.
But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):
- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
since it locks after checking page->_mapcount so:
* it can return !NULL for an unmapped page (your patch cures that)
* it can return !NULL but for a different anon_vma
(my earlier patch checking page_rmapping() after the spin_lock
cures that, but doesn't cure the above):
[ highly unlikely but not impossible race ]
page_referenced(page_A)
try_to_unmap(page_A)
unrelated fault
fault page_A
CPU0 CPU1 CPU2 CPU3
rcu_read_lock()
anon_vma = page->mapping;
if (!anon_vma & ANON_BIT)
goto out
if (!page_mapped(page))
goto out
page_remove_rmap()
...
anon_vma_free()-----\
v
anon_vma_alloc()
anon_vma_alloc()
page_add_anon_rmap()
^
spin_lock(anon_vma->lock)----------/
Now I don't think the above can happen due to how our slab
allocators work, they won't share a slab page between cpus like
that, but once we make the whole thing preemptible this race
becomes a lot more likely.
So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.
struct anon_vma *page_lock_anon_vma(struct page *page)
{
struct anon_vma *anon_vma;
unsigned long anon_mapping;
rcu_read_lock();
again:
anon_mapping = (unsigned long)rcu_dereference(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);
/*
* The RCU read lock ensures we can safely dereference anon_vma
* since it ensures the backing slab won't go away. It will however
* not guarantee it's the right object.
*
* First take the anon_vma->lock, this will, per anon_vma_unlink()
* avoid this anon_vma from being freed if it is a valid object.
*/
spin_lock(&anon_vma->lock);
/*
* Secondly, we have to re-read page->mapping, so ensure it
* has not changed, rely on spin_lock() being at least a
* compiler barrier to force the re-read.
*/
if (unlikely(page_rmapping(page) != anon_vma)) {
spin_unlock(&anon_vma->lock);
goto again;
}
/*
* Ensure we read page->mapping before page->_mapcount,
* orders against atomic_add_negative() in page_remove_rmap().
*/
smp_rmb();
/*
* Finally check that the page is still mapped,
* if not, this can't possibly be the right anon_vma.
*/
if (!page_mapped(page))
goto unlock;
return anon_vma;
unlock:
spin_unlock(&anon_vma->lock);
out:
rcu_read_unlock();
return NULL;
}
With this, I think we can actually drop the RCU read lock when returning
since if this is indeed a valid anon_vma for this page, then the page is
still mapped, and hence the anon_vma was not deleted, and a possible
future delete will be held back by us holding the anon_vma->lock.
Now I could be totally wrong and have confused myself throroughly, but
how does this look?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists