[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100428155558.GI15815@csn.ul.ie>
Date: Wed, 28 Apr 2010 16:55:58 +0100
From: Mel Gorman <mel@....ul.ie>
To: Andrea Arcangeli <aarcange@...hat.com>
Cc: Linux-MM <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
Minchan Kim <minchan.kim@...il.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Christoph Lameter <cl@...ux.com>,
Rik van Riel <riel@...hat.com>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/3] mm,migration: Prevent rmap_walk_[anon|ksm] seeing
the wrong VMA information
On Wed, Apr 28, 2010 at 05:35:25PM +0200, Andrea Arcangeli wrote:
> On Wed, Apr 28, 2010 at 10:15:55AM +0100, Mel Gorman wrote:
> > It became unconditional because I wasn't sure of the optimisation versus the
> > new anon_vma changes (doesn't matter, should have been safe). At the time
>
> Changeset 287d97ac032136724143cde8d5964b414d562ee3 is meant to explain
> the removal of the lock but I don't get it from the comments. Or at
> least I don't get from that comment why we can't resurrect the plain
> old deleted code that looked fine to me.
Frankly, I don't understand why it was safe to drop the lock either.
Maybe it was a mistake but I still haven't convinced myself I fully
understand the subtleties of the anon_vma changes.
> Like there is no reason to
> take the lock if start == vma->vm_start.
>
> > So, the VMA list does not appear to be messed up but there still needs
> > to be protection against modification of VMA details that are already on
> > the list. For that, the seq counter would have been enough and
> > lighter-weight than acquiring the anon_vma->lock every time in
> > vma_adjust().
> >
> > I'll drop this patch again as the execve race looks the most important.
>
> You mean you're dropping patch 2 too?
Temporarily at least until I figured out if execve was the only problem. The
locking in vma_adjust didn't look like the prime reason for the crash
but the lack of locking there is still very odd.
> I agree dropping patch 1 but
> to me the having to take all the anon_vma locks for every
> vma->anon_vma->lock that we walk seems a must, otherwise
> expand_downwards and vma_adjust won't be ok, plus we need to re-add
> the anon_vma lock to vma_adjust, it can't be safe to alter vm_pgoff
> and vm_start outside of the anon_vma->lock. Or I am mistaken?
>
No, you're not. If nothing else, vma_address can return the wrong value because
the VMAs vm_start and vm_pgoff were in the process of being updated but not
fully updated. It's hard to see how vma_address would return the wrong value
and miss a migration PTE as a result but it's a possibility. It's probably
a lot more important for transparent hugepage support.
> Patch 2 wouldn't help the swapops crash we reproduced because at that
> point the anon_vma of the stack is the local one, it's just after
> execve.
>
> vma_adjust and expand_downards would alter vm_pgoff and vm_start while
> taking only the vma->anon_vma->lock where the vma->anon_vma is the
> _local_ one of the vma.
True, although in the case of expand_downwards, it's highly unlikely that
there is also a migration PTE to be cleaned up. It's hard to see how a
migration PTE would be left behind in that case but it still looks wrong to
be updating the VMA fields without locking.
> But a vma in mainline can be indexed in
> infinite anon_vmas, so to prevent breaking migration
> vma_adjust/expand_downards the rmap_walk would need to take _all_
> anon_vma->locks for every anon_vma that the vma is indexed into.
I felt this would be too heavy in the common case which is why I made
rmap_walk() do the try-lock-or-back-off instead because rmap_walk is typically
in far less critical paths.
> Or
> alternatively like you implemented rmap_walk would need to check if
> the vma we found in the rmap_walk is different from the original
> anon_vma and to take the vma->anon_vma->lock (so taking the
> anon_vma->lock of the local anon_vma of every vma) before it can
> actually read the vma->vm_pgoff/vm_start inside vma_address.
>
To be absolutly sure, yes this is required. I don't think we've been hitting
this exact problem in these tests but it still is a better plan than adjusting
VMA details without locks.
> If the above is right it also means the new anon-vma changes also break
> the whole locking of transparent hugepage, see wait_split_huge_page,
> it does a spin_unlock_wait(&anon_vma->lock) thinking that waiting the
> "local" anon-vma is enough, when in fact the hugepage may be shared
> and belonging to the parent parent_vma->anon_vma and not to the local
> one of the last child that is waiting on the wrong lock. So I may have
> to rewrite this part of the thp locking to solve this. And for me it's
> not enough to just taking more locks inside the rmap walks inside
> split_huge_page as I used the anon_vma lock outside too.
>
No fun. That potentially could be a lot of locks to take to split the
page.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists