linux-kernel - Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing the wrong VMA information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100428162305.GX510@random.random>
Date:	Wed, 28 Apr 2010 18:23:05 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Linux-MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
	Minchan Kim <minchan.kim@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing
 the wrong VMA information

On Wed, Apr 28, 2010 at 04:55:58PM +0100, Mel Gorman wrote:
> Frankly, I don't understand why it was safe to drop the lock either.
> Maybe it was a mistake but I still haven't convinced myself I fully
> understand the subtleties of the anon_vma changes.

I understand the design but I'm also unsure about the details. it's
just this lock that gets splitted and when you update the
vm_start/pgoff with only the vma->anon_vma->lock, the vma may be
queued in multiple other anon_vmas, and you're only serializing
and safe for the pages that have page_mapcount 1, and point to the
local anon_vma == vma->anon_vma, not any other shared page.

The removal of the vma->anon_vma->lock from vma_adjust just seems an
unrelated mistake to me too, but I don't know for sure why yet.

Basically vma_adjust needs the anon_vma lock like expand_downards has.

After you fix vma_adjust to be as safe as expand_downards you've also
to take care of the rmap_walk that may run on a page->mapping =
anon_vma that isn't the vma->anon_vma and you're not taking that
anon_vma->lock of the shared page, when you change the vma
vm_pgoff/vm_start. If rmap_walk finds to find a pte, becauase of that,
migrate will crash.

> Temporarily at least until I figured out if execve was the only problem. The

For aa.git it is sure enough. And as long as you only see the migrate
crash in execve it's also sure enough.

> locking in vma_adjust didn't look like the prime reason for the crash
> but the lack of locking there is still very odd.

And I think it needs fixing to be safe.

> 
> > I agree dropping patch 1 but
> > to me the having to take all the anon_vma locks for every
> > vma->anon_vma->lock that we walk seems a must, otherwise
> > expand_downwards and vma_adjust won't be ok, plus we need to re-add
> > the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> > and vm_start outside of the anon_vma->lock. Or I am mistaken?
> > 
> 
> No, you're not. If nothing else, vma_address can return the wrong value because
> the VMAs vm_start and vm_pgoff were in the process of being updated but not
> fully updated. It's hard to see how vma_address would return the wrong value
> and miss a migration PTE as a result but it's a possibility.  It's probably
> a lot more important for transparent hugepage support.

For the rmap_walk itself, migrate and split_huge_page are identical,
the main problem of transparent hugepage support is that I used the
anon_vma->lock in a wider way and taken well before the rmap_walk, so
I'm screwed in a worse way than migrate.

So I may have to change from anon_vma->lock to the compound_lock in
wait_split_huge_page(). But I'll still need the restarting loop of
anon_vma locks then inside the two rmap_walk run by split_huge_page.
Problem is, I would have preferred to do this locking change later as
a pure optimization than as a requirement for merging and running
stable, as it'll make things slightly more complex.

BTW, if we were to share the lock across all anon_vmas as I mentioned
in prev email, and just reduce the chain length, then it'd solve all
issues for rmap_walk in migrate and also for THP completely.

> > Patch 2 wouldn't help the swapops crash we reproduced because at that
> > point the anon_vma of the stack is the local one, it's just after
> > execve.
> > 
> > vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> > taking only the vma->anon_vma->lock where the vma->anon_vma is the
> > _local_ one of the vma. 
> 
> True, although in the case of expand_downwards, it's highly unlikely that
> there is also a migration PTE to be cleaned up. It's hard to see how a
> migration PTE would be left behind in that case but it still looks wrong to
> be updating the VMA fields without locking.

Every time we fail to find the PTE, it can also mean try_to_unmap just
failed to instantiate the migration pte leading to random memory
corruption in migrate. If a task fork and the some of the stack pages
at the bottom of the stack are shared, but the top of the stack isn't
shared (so the vma->anon_vma->lock only protects the top and not the
bottom) migrate should be able to silently random corrupt memory right
now because of this.

> > But a vma in mainline can be indexed in
> > infinite anon_vmas, so to prevent breaking migration
> > vma_adjust/expand_downards the rmap_walk would need to take _all_
> > anon_vma->locks for every anon_vma that the vma is indexed into.
> 
> I felt this would be too heavy in the common case which is why I made
> rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
> in far less critical paths.

If we could take all locks, it'd make life easier as it'd already
implement the "shared lock" but without sharing it. It won't provide
much runtime benefit though (just rmap_walk will be even slower than
real shared lock, and vma_adjust/expand_downards will be slightly faster).

> > Or
> > alternatively like you implemented rmap_walk would need to check if
> > the vma we found in the rmap_walk is different from the original
> > anon_vma and to take the vma->anon_vma->lock (so taking the
> > anon_vma->lock of the local anon_vma of every vma) before it can
> > actually read the vma->vm_pgoff/vm_start inside vma_address.
> > 
> 
> To be absolutly sure, yes this is required. I don't think we've been hitting
> this exact problem in these tests but it still is a better plan than adjusting
> VMA details without locks.

We've not been hitting it unless somebody crashed with random
corruption.

> > If the above is right it also means the new anon-vma changes also break
> > the whole locking of transparent hugepage, see wait_split_huge_page,
> > it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> > "local" anon-vma is enough, when in fact the hugepage may be shared
> > and belonging to the parent parent_vma->anon_vma and not to the local
> > one of the last child that is waiting on the wrong lock. So I may have
> > to rewrite this part of the thp locking to solve this. And for me it's
> > not enough to just taking more locks inside the rmap walks inside
> > split_huge_page as I used the anon_vma lock outside too.
> > 
> 
> No fun. That potentially could be a lot of locks to take to split the
> page.

compound_lock should be able to replace it in a more granular way, but
it's not exactly the time I was looking to apply scalar optimization
to THP as it may introduce other issues I can't foresee right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/