linux-kernel - Re: [PATCH -mm] change anon_vma linking to fix multi-process server scalability issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20100129151423.8b71b88e.akpm@linux-foundation.org>
Date:	Fri, 29 Jan 2010 15:14:23 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Rik van Riel <riel@...hat.com>
Cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	lwoodman@...hat.com, Lee Schermerhorn <Lee.Schermerhorn@...com>,
	aarcange@...hat.com
Subject: Re: [PATCH -mm] change anon_vma linking to fix multi-process server
 scalability issue

On Thu, 28 Jan 2010 00:20:00 -0500
Rik van Riel <riel@...hat.com> wrote:

> The old anon_vma code can lead to scalability issues with heavily
> forking workloads.  Specifically, each anon_vma will be shared
> between the parent process and all its child processes.
> 
> In a workload with 1000 child processes and a VMA with 1000 anonymous
> pages per process that get COWed, this leads to a system with a million
> anonymous pages in the same anon_vma, each of which is mapped in just
> one of the 1000 processes.  However, the current rmap code needs to
> walk them all, leading to O(N) scanning complexity for each page.
> 
> This can result in systems where one CPU is walking the page tables
> of 1000 processes in page_referenced_one, while all other CPUs are
> stuck on the anon_vma lock.  This leads to catastrophic failure for
> a benchmark like AIM7, where the total number of processes can reach
> in the tens of thousands.  Real workloads are still a factor 10 less
> process intensive than AIM7, but they are catching up.
> 
> This patch changes the way anon_vmas and VMAs are linked, which
> allows us to associate multiple anon_vmas with a VMA.  At fork
> time, each child process gets its own anon_vmas, in which its
> COWed pages will be instantiated.  The parents' anon_vma is also
> linked to the VMA, because non-COWed pages could be present in
> any of the children.
> 
> This reduces rmap scanning complexity to O(1) for the pages of
> the 1000 child processes, with O(N) complexity for at most 1/N
> pages in the system.  This reduces the average scanning cost in
> heavily forking workloads from O(N) to 2.
> 
> The only real complexity in this patch stems from the fact that
> linking a VMA to anon_vmas now involves memory allocations. This
> means vma_adjust can fail, if it needs to attach a VMA to anon_vma
> structures. This in turn means error handling needs to be added
> to the calling functions.
> 
> A second source of complexity is that, because there can be
> multiple anon_vmas, the anon_vma linking in vma_adjust can
> no longer be done under "the" anon_vma lock.  To prevent the
> rmap code from walking up an incomplete VMA, this patch
> introduces the VM_LOCK_RMAP VMA flag.  This bit flag uses
> the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef
> in mm.h to make sure it is impossible to compile a kernel
> that needs both symbolic values for the same bitflag.
> 
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -96,7 +96,11 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
>  #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
>  #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
> +#ifdef CONFIG_MMU
> +#define VM_LOCK_RMAP	0x01000000	/* Do not follow this rmap (mmu mmap) */
> +#else
>  #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
> +#endif

What's the locking for vma_area_struct.vm_flags?  It's somewhat
unobvious what that is, and whether the code here observes it.


>  #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
>  #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
>  
> @@ -1108,7 +1112,7 @@ static inline void vma_nonlinear_insert(struct vm_area_struct *vma,
>  
>  /* mmap.c */
>  extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
> -extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
> +extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
>  extern struct vm_area_struct *vma_merge(struct mm_struct *,
>  	struct vm_area_struct *prev, unsigned long addr, unsigned long end,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 84a524a..2a06fe1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -167,7 +167,7 @@ struct vm_area_struct {
>  	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
>  	 * or brk vma (with NULL file) can only be in an anon_vma list.
>  	 */
> -	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
> +	struct list_head anon_vma_chain; /* Serialized by mmap_sem & friends */

Can we be a bit more explicit than "and friends"?

>  	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */
>  
>  	/* Function pointers to deal with this struct. */
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b019ae6..c0a6056 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -37,7 +37,27 @@ struct anon_vma {
>  	 * is serialized by a system wide lock only visible to
>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>  	 */
> -	struct list_head head;	/* List of private "related" vmas */
> +	struct list_head head;	/* Chain of private "related" vmas */
> +};
> +
> +/*
> + * The copy-on-write semantics of fork mean that an anon_vma
> + * can become associated with multiple processes. Furthermore,
> + * each child process will have its own anon_vma, where new
> + * pages for that process are instantiated.
> + *
> + * This structure allows us to find the anon_vmas associated
> + * with a VMA, or the VMAs associated with an anon_vma.
> + * The "same_vma" list contains the anon_vma_chains linking
> + * all the anon_vmas associated with this VMA.
> + * The "same_anon_vma" list contains the anon_vma_chains
> + * which link all the VMAs associated with this anon_vma.
> + */
> +struct anon_vma_chain {
> +	struct vm_area_struct *vma;
> +	struct anon_vma *anon_vma;
> +	struct list_head same_vma;	/* locked by mmap_sem & friends */

"friends"?

> +	struct list_head same_anon_vma;	/* locked by anon_vma->lock */
>  };
>  
>  #ifdef CONFIG_MMU
> @@ -89,12 +109,19 @@ static inline void anon_vma_unlock(struct vm_area_struct *vma)
>   */
>  void anon_vma_init(void);	/* create anon_vma_cachep */
>  int  anon_vma_prepare(struct vm_area_struct *);
> -void __anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
> -void anon_vma_unlink(struct vm_area_struct *);
> -void anon_vma_link(struct vm_area_struct *);
> +void unlink_anon_vmas(struct vm_area_struct *);
> +int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
> +int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
>  void __anon_vma_link(struct vm_area_struct *);
>  void anon_vma_free(struct anon_vma *);
>  
>
> ...
>
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -328,15 +328,17 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  		if (!tmp)
>  			goto fail_nomem;
>  		*tmp = *mpnt;
> +		INIT_LIST_HEAD(&tmp->anon_vma_chain);
>  		pol = mpol_dup(vma_policy(mpnt));
>  		retval = PTR_ERR(pol);
>  		if (IS_ERR(pol))
>  			goto fail_nomem_policy;
>  		vma_set_policy(tmp, pol);
> +		if (anon_vma_fork(tmp, mpnt))
> +			goto fail_nomem_anon_vma_fork;

Didn't we just leak `pol'?

>  		tmp->vm_flags &= ~VM_LOCKED;
>  		tmp->vm_mm = mm;
>  		tmp->vm_next = NULL;
> -		anon_vma_link(tmp);
>  		file = tmp->vm_file;
>  		if (file) {
>  			struct inode *inode = file->f_path.dentry->d_inode;
>
> ...
>
> @@ -542,6 +541,29 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> +	/*
> +	 * When changing only vma->vm_end, we don't really need
> +	 * anon_vma lock.
> +	 */
> +	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
> +		anon_vma = vma->anon_vma;
> +	if (anon_vma) {
> +		/*
> +		 * Easily overlooked: when mprotect shifts the boundary,
> +		 * make sure the expanding vma has anon_vma set if the
> +		 * shrinking vma had, to cover any anon pages imported.
> +		 */
> +		if (importer && !importer->anon_vma) {
> +			/* Block reverse map lookups until things are set up. */

Here, it'd be nice to explain the _reason_ for doing this.

> 			importer->vm_flags |= VM_LOCK_RMAP;
> +			if (anon_vma_clone(importer, vma)) {
> +				importer->vm_flags &= ~VM_LOCK_RMAP;
> +				return -ENOMEM;
> +			}
> +			importer->anon_vma = anon_vma;
> +		}
> +	}
> +
>  	if (file) {
>  		mapping = file->f_mapping;
>  		if (!(vma->vm_flags & VM_NONLINEAR))
> @@ -567,25 +589,6 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  	}
>  
> -	/*
> -	 * When changing only vma->vm_end, we don't really need
> -	 * anon_vma lock.
> -	 */
> -	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
> -		anon_vma = vma->anon_vma;
> -	if (anon_vma) {
> -		spin_lock(&anon_vma->lock);
> -		/*
> -		 * Easily overlooked: when mprotect shifts the boundary,
> -		 * make sure the expanding vma has anon_vma set if the
> -		 * shrinking vma had, to cover any anon pages imported.
> -		 */
> -		if (importer && !importer->anon_vma) {
> -			importer->anon_vma = anon_vma;
> -			__anon_vma_link(importer);
> -		}
> -	}
> -
>  	if (root) {
>  		flush_dcache_mmap_lock(mapping);
>  		vma_prio_tree_remove(vma, root);
> @@ -616,8 +619,11 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		__vma_unlink(mm, next, vma);
>  		if (file)
>  			__remove_shared_vm_struct(next, file, mapping);
> -		if (next->anon_vma)
> -			__anon_vma_merge(vma, next);
> +		/*
> +		 * This VMA is now dead, no need for rmap to follow it.
> +		 * Call anon_vma_merge below, outside of i_mmap_lock.
> +		 */
> +		next->vm_flags |= VM_LOCK_RMAP;
>  	} else if (insert) {
>  		/*
>  		 * split_vma has split insert from vma, and needs
> @@ -627,17 +633,25 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		__insert_vm_struct(mm, insert);
>  	}
>  
> -	if (anon_vma)
> -		spin_unlock(&anon_vma->lock);
>  	if (mapping)
>  		spin_unlock(&mapping->i_mmap_lock);
>  
> +	/*
> +	 * The current VMA has been set up. It is now safe for the
> +	 * rmap code to get from the pages to the ptes.
> +	 */
> +	if (anon_vma && importer)
> +		importer->vm_flags &= ~VM_LOCK_RMAP;

This clears a flag which this function might not have set - the
(importer->anon_vma != NULL) case.

Seems a bit messy at least.  Perhaps a local i_need_to_clear_the_bit
bool would be more robust.

>  	if (remove_next) {
>  		if (file) {
>  			fput(file);
>  			if (next->vm_flags & VM_EXECUTABLE)
>  				removed_exe_file_vma(mm);
>  		}
> +		/* Protected by mmap_sem and VM_LOCK_RMAP. */
> +		if (next->anon_vma)
> +			anon_vma_merge(vma, next);
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
>  		kmem_cache_free(vm_area_cachep, next);
>
> ...
>
> @@ -1868,11 +1893,14 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
>  
>  	pol = mpol_dup(vma_policy(vma));
>  	if (IS_ERR(pol)) {
> -		kmem_cache_free(vm_area_cachep, new);
> -		return PTR_ERR(pol);
> +		err = PTR_ERR(pol);
> +		goto out_free_vma;
>  	}
>  	vma_set_policy(new, pol);
>  
> +	if (anon_vma_clone(new, vma))
> +		goto out_free_mpol;

The handling of `err' in this function is a bit tricksy.  It's correct,
but not obviously so and we might break it in the future.  One way to
address that would be to sprinkle `err = -ENOMEM' everywhere and
wouldn't be very nice.  Ho hum.

>  	if (new->vm_file) {
>  		get_file(new->vm_file);
>  		if (vma->vm_flags & VM_EXECUTABLE)
> @@ -1883,12 +1911,28 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
>  		new->vm_ops->open(new);
>  
>  	if (new_below)
> -		vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
> +		err = vma_adjust(vma, addr, vma->vm_end, vma->vm_pgoff +
>  			((addr - new->vm_start) >> PAGE_SHIFT), new);
>  	else
> -		vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
> +		err = vma_adjust(vma, vma->vm_start, addr, vma->vm_pgoff, new);
>  
> -	return 0;
> +	/* Success. */
> +	if (!err)
> +		return 0;
> +
> +	/* Clean everything up if vma_adjust failed. */
> +	new->vm_ops->close(new);
> +	if (new->vm_file) {
> +		if (vma->vm_flags & VM_EXECUTABLE)
> +			removed_exe_file_vma(mm);
> +		fput(new->vm_file);
> +	}

Did the above path get tested?

> + out_free_mpol:
> +	mpol_put(pol);
> + out_free_vma:
> +	kmem_cache_free(vm_area_cachep, new);
> + out_err:
> +	return err;
>  }
>  
>  /*
>
> ...
>
> +int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  {
> -	struct anon_vma *anon_vma = vma->anon_vma;
> +	struct anon_vma_chain *avc;
> +	struct anon_vma *anon_vma;
>  
> -	if (anon_vma) {
> -		spin_lock(&anon_vma->lock);
> -		list_add_tail(&vma->anon_vma_node, &anon_vma->head);
> -		spin_unlock(&anon_vma->lock);
> -	}
> +	/* Don't bother if the parent process has no anon_vma here. */
> +	if (!pvma->anon_vma)
> +		return 0;
> +
> +	/*
> +	 * First, attach the new VMA to the parent VMA's anon_vmas,
> +	 * so rmap can find non-COWed pages in child processes.
> +	 */
> +	if (anon_vma_clone(vma, pvma))
> +		return -ENOMEM;
> +
> +	/* Then add our own anon_vma. */
> +	anon_vma = anon_vma_alloc();
> +	if (!anon_vma)
> +		goto out_error;
> +	avc = anon_vma_chain_alloc();
> +	if (!avc)
> +		goto out_error_free_anon_vma;

The error paths here don't undo the results of anon_vma_clone().  I
guess all those anon_vma_chains get unlinked and freed later on? 
free/exit()?

> +	anon_vma_chain_link(vma, avc, anon_vma);
> +	/* Mark this anon_vma as the one where our new (COWed) pages go. */
> +	vma->anon_vma = anon_vma;
> +
> +	return 0;
> +
> + out_error_free_anon_vma:
> +	anon_vma_free(anon_vma);
> + out_error:
> +	return -ENOMEM;
>  }
>  
>
> ...
>
> @@ -192,6 +280,9 @@ void __init anon_vma_init(void)
>  {
>  	anon_vma_cachep = kmem_cache_create("anon_vma", sizeof(struct anon_vma),
>  			0, SLAB_DESTROY_BY_RCU|SLAB_PANIC, anon_vma_ctor);
> +	anon_vma_chain_cachep = kmem_cache_create("anon_vma_chain",
> +			sizeof(struct anon_vma_chain), 0,
> +			SLAB_PANIC, NULL);

Could use KMEM_CACHE() here.

>  }
>  
>  /*
>
> ...
>

Trivial touchups:

 fs/exec.c           |    2 +-
 mm/memory-failure.c |    5 +++--
 mm/mmap.c           |    3 +--
 mm/rmap.c           |    9 +++++----
 4 files changed, 10 insertions(+), 9 deletions(-)

diff -puN fs/exec.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix fs/exec.c
--- a/fs/exec.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix
+++ a/fs/exec.c
@@ -549,7 +549,7 @@ static int shift_arg_pages(struct vm_are
 	tlb_finish_mmu(tlb, new_end, old_end);
 
 	/*
-	 * shrink the vma to just the new range.  always succeeds.
+	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
 	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
 
diff -puN mm/memory-failure.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix mm/memory-failure.c
--- a/mm/memory-failure.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix
+++ a/mm/memory-failure.c
@@ -375,7 +375,6 @@ static void collect_procs_anon(struct pa
 			      struct to_kill **tkc)
 {
 	struct vm_area_struct *vma;
-	struct anon_vma_chain *vmac;
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
@@ -384,9 +383,11 @@ static void collect_procs_anon(struct pa
 	if (av == NULL)	/* Not actually mapped anymore */
 		goto out;
 	for_each_process (tsk) {
+		struct anon_vma_chain *vmac;
+
 		if (!task_early_kill(tsk))
 			continue;
-		list_for_each_entry (vmac, &av->head, same_anon_vma) {
+		list_for_each_entry(vmac, &av->head, same_anon_vma) {
 			vma = vmac->vma;
 			if (!page_mapped_in_vma(page, vma))
 				continue;
diff -puN mm/mmap.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix mm/mmap.c
--- a/mm/mmap.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix
+++ a/mm/mmap.c
@@ -542,8 +542,7 @@ again:			remove_next = 1 + (end > next->
 	}
 
 	/*
-	 * When changing only vma->vm_end, we don't really need
-	 * anon_vma lock.
+	 * When changing only vma->vm_end, we don't really need anon_vma lock.
 	 */
 	if (vma->anon_vma && (insert || importer || start != vma->vm_start))
 		anon_vma = vma->anon_vma;
diff -puN mm/rmap.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix mm/rmap.c
--- a/mm/rmap.c~mm-change-anon_vma-linking-to-fix-multi-process-server-scalability-issue-fix
+++ a/mm/rmap.c
@@ -331,17 +331,18 @@ vma_address(struct page *page, struct vm
 		/* page should be within @vma mapping range */
 		return -EFAULT;
 	}
-	if (unlikely(vma->vm_flags & VM_LOCK_RMAP))
+	if (unlikely(vma->vm_flags & VM_LOCK_RMAP)) {
 		/*
-		 * This VMA is being unlinked or not yet linked into the
+		 * This VMA is being unlinked or is not yet linked into the
 		 * VMA tree.  Do not try to follow this rmap.  This race
-		 * condition can result in page_referenced ignoring a
-		 * reference or try_to_unmap failing to unmap a page.
+		 * condition can result in page_referenced() ignoring a
+		 * reference or in try_to_unmap() failing to unmap a page.
 		 * The VMA cannot be freed under us because we hold the
 		 * anon_vma->lock, which the munmap code takes while
 		 * unlinking the anon_vmas from the VMA.
 		 */
 		return -EFAULT;
+	}
 	return address;
 }
 
_

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/