lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100508153922.GS5941@random.random>
Date:	Sat, 8 May 2010 17:39:22 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Minchan Kim <minchan.kim@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 1/2] mm,migration: Prevent rmap_walk_[anon|ksm] seeing
 the wrong VMA information

On Fri, May 07, 2010 at 12:20:52AM +0100, Mel Gorman wrote:
> @@ -1655,6 +1655,7 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
>  {
>  	struct stable_node *stable_node;
>  	struct hlist_node *hlist;
> +	struct anon_vma *nested_anon_vma = NULL;
>  	struct rmap_item *rmap_item;
>  	int ret = SWAP_AGAIN;
>  	int search_new_forks = 0;
> @@ -1671,9 +1672,16 @@ again:
>  		struct anon_vma_chain *vmac;
>  		struct vm_area_struct *vma;
>  
> -		spin_lock(&anon_vma->lock);
> +		anon_vma = anon_vma_lock_root(anon_vma);
> +		if (nested_anon_vma) {
> +			spin_unlock(&nested_anon_vma->lock);
> +			nested_anon_vma = NULL;
> +		}
>  		list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
>  			vma = vmac->vma;
> +			nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> +						vma->anon_vma, anon_vma);
> +
>  			if (rmap_item->address < vma->vm_start ||
>  			    rmap_item->address >= vma->vm_end)
>  				continue;
> @@ -1368,19 +1444,26 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>  	 * are holding mmap_sem. Users without mmap_sem are required to
>  	 * take a reference count to prevent the anon_vma disappearing
>  	 */
> -	anon_vma = page_anon_vma(page);
> +	anon_vma = page_anon_vma_lock_root(page);
>  	if (!anon_vma)
>  		return ret;
> -	spin_lock(&anon_vma->lock);
>  	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>  		struct vm_area_struct *vma = avc->vma;
> -		unsigned long address = vma_address(page, vma);
> -		if (address == -EFAULT)
> -			continue;
> -		ret = rmap_one(page, vma, address, arg);
> +		unsigned long address;
> +
> +		nested_anon_vma = anon_vma_lock_nested(nested_anon_vma,
> +						vma->anon_vma, anon_vma);
> +		address = vma_address(page, vma);
> +		if (address != -EFAULT)
> +			ret = rmap_one(page, vma, address, arg);
> +
>  		if (ret != SWAP_AGAIN)
>  			break;
>  	}
> +
> +	if (nested_anon_vma)
> +		spin_unlock(&nested_anon_vma->lock);
> +
>  	spin_unlock(&anon_vma->lock);
>  	return ret;
>  }

I already told Mel by PM. This degrades the new-anon_vma code to an
even _slower_ mode than the old anon-vma code in 2.6.32 (the same in
math complexity terms but slower in practice) for migrate. Furthermore
page_referenced() may now return true even if there are young ptes
that simply get lost in the rmap walk.

The new anon-vma code is mostly relevant for migrate and memory
compaction and transparent hugepage support where it gets invoked even
if there's plenty of free memory and no I/O load at all. So whatever
you save during swap, you'll lose while transparent hugepage support
allocate the pages. So the above fix renders the whole effort
pointless as far as I'm concerned.

I think Rik's patch is the only sensible solution that won't
invalidate the whole effort for transparent hugepage.

About how to adapt split_huge_page to the root anon_vma I didn't even
think about it yet. All I can tell you right now is that
wait_split_huge_page can be changed to wait on the pmd_trans_splitting
(or alternatively the pmd_trans_huge bit) bit to go away in a
cpu_relax() barrier() loop. But the page->mapping/anon_vma->lock is
also used to serialize against parallel split_huge_page but I guess
taking the root anon_vma lock in split_huge_page() should work
fine. Just I'm not going to do that except maybe in a for-mainline
branch, but I'll keep master branch with the old-anon-vma 2.6.32 code
and the anon_vma_branch with Rik's fix that allows to take advantage
of the new anon-vma code (so it's not purely gratuitous complexity
added for nothing) also in migrate.c from memory compaction (that runs
24/7 on all my systems and it's much more frequent than the swap rmap
walks that in fact never ever happens here), and in the rmap walks in
split_huge_page too (which are not so frequent especially after
Johannes implemented native mprotect on hugepages but it's obviously
still more frequent than swap).

I'm simply not going to support the degradation to the root anon_vma
complexity in aa.git, except for strict merging requirements but I'll
keep backing it out in aa.git or I'd rather stick to old-anon-vma
code which at least is much simpler and saves memory too (as there are
many fewer anon-vma and no avc, and less useless locks).

What I instead already mentioned once was to add a _shared_ lock so
you share the spinlock across the whole forest but you keep walking
the right page->anon_vma->same_anon_vma! The moment you walk the
page->anon_vma->root_anon_vma->same_anon_vma you lost my support as it
makes the whole effort pointless compared to 2.6.32 as far as 99% of my
workloads are concerned.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ