linux-kernel - Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1271097638.4807.129.camel@twins>
Date:	Mon, 12 Apr 2010 20:40:38 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Rik van Riel <riel@...hat.com>, Borislav Petkov <bp@...en8.de>,
	Johannes Weiner <hannes@...xchg.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Minchan Kim <minchan.kim@...il.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Nick Piggin <npiggin@...e.de>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	sgunderson@...foot.com
Subject: Re: [PATCH -v2] rmap: make anon_vma_prepare link in all the
 anon_vmas of a mergeable VMA

On Mon, 2010-04-12 at 09:46 -0700, Linus Torvalds wrote: 
> 
> On Mon, 12 Apr 2010, Rik van Riel wrote:
> 
> > On 04/12/2010 12:01 PM, Peter Zijlstra wrote:
> > 
> > > @@ -864,15 +889,8 @@ void page_remove_rmap(struct page *page)
> > >   		__dec_zone_page_state(page, NR_FILE_MAPPED);
> > >   		mem_cgroup_update_file_mapped(page, -1);
> > >   	}
> > > -	/*
> > > -	 * It would be tidy to reset the PageAnon mapping here,
> > > -	 * but that might overwrite a racing page_add_anon_rmap
> > > -	 * which increments mapcount after us but sets mapping
> > > -	 * before us: so leave the reset to free_hot_cold_page,
> > > -	 * and remember that it's only reliable while mapped.
> > > -	 * Leaving it set also helps swapoff to reinstate ptes
> > > -	 * faster for those pages still in swapcache.
> > > -	 */
> > > +
> > > +	page->mapping = NULL;
> > >   }
> > 
> > That would be a bug for file pages :)
> > 
> > I could see how it could work for anonymous memory, though.
> 
> I think it's scary for anonymous pages too. The _common_ case of 
> page_remove_rmap() is from unmap/exit, which holds no locks on the page 
> what-so-ever. So assuming the page could be reachable some other way (swap 
> cache etc), I think the above is pretty scary. 

Fully agreed.

> Also do note that the bug we've been chasing has _always_ had that test 
> for "page_mapped(page)". See my other email about why the unmapped case 
> isn't even interesting, because it's so easy to see how page->mapping can 
> be stale for unmapped pages.
> 
> It's the _mapped_ case that is interesting, not the unmapped one. So 
> setting page->mapping to NULL when unmapping is perhaps a nice consistency 
> issue ("never have stale pointers"), but it's missing the fact that it's 
> not really the case we care about.

Yes, I don't think this is the problem that has been plaguing us for
over a week now.

But while staring at that code it did get me worried that the current
code (page_lock_anon_vma):

- is missing the smp_read_barrier_depends() after the ACCESS_ONCE
- isn't properly ordered wrt page->mapping and page->_mapcount.
- doesn't appear to guarantee much at all when returning an anon_vma
  since it locks after checking page->_mapcount so:
    * it can return !NULL for an unmapped page (your patch cures that)
    * it can return !NULL but for a different anon_vma
      (my earlier patch checking page_rmapping() after the spin_lock
       cures that, but doesn't cure the above):

        [ highly unlikely but not impossible race ]

        page_referenced(page_A)

			try_to_unmap(page_A)

					unrelated fault

							fault page_A

	CPU0		CPU1		CPU2		CPU3

	rcu_read_lock()
	anon_vma = page->mapping;
	if (!anon_vma & ANON_BIT)
	  goto out
	if (!page_mapped(page))
	  goto out

			page_remove_rmap()
			...
			anon_vma_free()-----\
					    v
					anon_vma_alloc()
					
							anon_vma_alloc()
							page_add_anon_rmap()
					   ^
	spin_lock(anon_vma->lock)----------/


    Now I don't think the above can happen due to how our slab
    allocators work, they won't share a slab page between cpus like
    that, but once we make the whole thing preemptible this race
    becomes a lot more likely.


So a page_lock_anon_vma(), that looks a little like the below should
(I think) cure all our problems with it.


struct anon_vma *page_lock_anon_vma(struct page *page)
{
	struct anon_vma *anon_vma;
	unsigned long anon_mapping;

	rcu_read_lock();
again:
	anon_mapping = (unsigned long)rcu_dereference(page->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	anon_vma = (struct anon_vma *)(anon_mapping - PAGE_MAPPING_ANON);

	/*
	 * The RCU read lock ensures we can safely dereference anon_vma
	 * since it ensures the backing slab won't go away. It will however
	 * not guarantee it's the right object.
	 *
	 * First take the anon_vma->lock, this will, per anon_vma_unlink()
	 * avoid this anon_vma from being freed if it is a valid object.
	 */
	spin_lock(&anon_vma->lock);

	/*
	 * Secondly, we have to re-read page->mapping, so ensure it
	 * has not changed, rely on spin_lock() being at least a
	 * compiler barrier to force the re-read.
	 */
	if (unlikely(page_rmapping(page) != anon_vma)) {
		spin_unlock(&anon_vma->lock);
		goto again;
	}

	/*
	 * Ensure we read page->mapping before page->_mapcount,
	 * orders against atomic_add_negative() in page_remove_rmap().
	 */
	smp_rmb();

	/*
	 * Finally check that the page is still mapped,
	 * if not, this can't possibly be the right anon_vma.
	 */
	if (!page_mapped(page))
		goto unlock;

	return anon_vma;

unlock:
	spin_unlock(&anon_vma->lock);
out:
	rcu_read_unlock();
	return NULL;
}


With this, I think we can actually drop the RCU read lock when returning
since if this is indeed a valid anon_vma for this page, then the page is
still mapped, and hence the anon_vma was not deleted, and a possible
future delete will be held back by us holding the anon_vma->lock.

Now I could be totally wrong and have confused myself throroughly, but
how does this look?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/