linux-kernel - Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1352913882.7130.131.camel@oc2024037011.ibm.com>
Date:	Wed, 14 Nov 2012 11:24:42 -0600
From:	Andrew Theurer <habanero@...ux.vnet.ibm.com>
To:	Mel Gorman <mgorman@...e.de>
Cc:	a.p.zijlstra@...llo.nl, riel@...hat.com, aarcange@...hat.com,
	lee.schermerhorn@...com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 18/31] mm: sched: numa: Implement constant, per task
 Working Set Sampling (WSS) rate


> From: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> 
> Note: The scan period is much larger than it was in the original patch.
> 	The reason was because the system CPU usage went through the roof
> 	with a sample period of 500ms but it was unsuitable to have a
> 	situation where a large process could stall for excessively long
> 	updating pte_numa. This may need to be tuned again if a placement
> 	policy converges too slowly.
> 
> Previously, to probe the working set of a task, we'd use
> a very simple and crude method: mark all of its address
> space PROT_NONE.
> 
> That method has various (obvious) disadvantages:
> 
>  - it samples the working set at dissimilar rates,
>    giving some tasks a sampling quality advantage
>    over others.
> 
>  - creates performance problems for tasks with very
>    large working sets
> 
>  - over-samples processes with large address spaces but
>    which only very rarely execute
> 
> Improve that method by keeping a rotating offset into the
> address space that marks the current position of the scan,
> and advance it by a constant rate (in a CPU cycles execution
> proportional manner). If the offset reaches the last mapped
> address of the mm then it then it starts over at the first
> address.

I believe we will have problems with this. For example, running a large
KVM VM with 512GB memory, using the new defaults in this patch, and
assuming we never go longer per scan than the scan_period_min, it would
take over an hour to scan the entire VM just once.  The defaults could
be changed, but ideally there should be no knobs like this in the final
version, as it should just work well under all conditions.

Also, if such a method is kept, would it be possible to base it on fixed
number of pages successfully marked instead of a MB range?  Reason I
bring it up is that we often can have VMs which are large in their
memory definition, but might not actually have a lot of pages faulted
in.  We could be "scanning" sections of vma which are not even actually
present yet.

> The per-task nature of the working set sampling functionality in this tree
> allows such constant rate, per task, execution-weight proportional sampling
> of the working set, with an adaptive sampling interval/frequency that
> goes from once per 2 seconds up to just once per 32 seconds.  The current
> sampling volume is 256 MB per interval.

Once a new section is marked, is the previous section automatically
reverted?  If not, I wonder if there's risk of building up a ton of
potential page faults?

> As tasks mature and converge their working set, so does the
> sampling rate slow down to just a trickle, 256 MB per 32
> seconds of CPU time executed.
> 
> This, beyond being adaptive, also rate-limits rarely
> executing systems and does not over-sample on overloaded
> systems.

I am wondering if it would be better to shrink the scan period back to a
much smaller fixed value, and instead of picking 256MB ranges of memory
to mark completely, go back to using all of the address space, but mark
only every Nth page.  N is adjusted each period to target a rolling
average of X faults per MB per execution time period.  This per task N
would also be an interesting value to rank memory access frequency among
tasks and help prioritize scheduling decisions.

-Andrew Theurer

> 
> [ In AutoNUMA speak, this patch deals with the effective sampling
>   rate of the 'hinting page fault'. AutoNUMA's scanning is
>   currently rate-limited, but it is also fundamentally
>   single-threaded, executing in the knuma_scand kernel thread,
>   so the limit in AutoNUMA is global and does not scale up with
>   the number of CPUs, nor does it scan tasks in an execution
>   proportional manner.
> 
>   So the idea of rate-limiting the scanning was first implemented
>   in the AutoNUMA tree via a global rate limit. This patch goes
>   beyond that by implementing an execution rate proportional
>   working set sampling rate that is not implemented via a single
>   global scanning daemon. ]
> 
> [ Dan Carpenter pointed out a possible NULL pointer dereference in the
>   first version of this patch. ]
> 
> Based-on-idea-by: Andrea Arcangeli <aarcange@...hat.com>
> Bug-Found-By: Dan Carpenter <dan.carpenter@...cle.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>
> Cc: Andrea Arcangeli <aarcange@...hat.com>
> Cc: Rik van Riel <riel@...hat.com>
> [ Wrote changelog and fixed bug. ]
> Signed-off-by: Ingo Molnar <mingo@...nel.org>
> Signed-off-by: Mel Gorman <mgorman@...e.de>
> Reviewed-by: Rik van Riel <riel@...hat.com>
> ---
>  include/linux/mm_types.h |    3 +++
>  include/linux/sched.h    |    1 +
>  kernel/sched/fair.c      |   61 ++++++++++++++++++++++++++++++++++++----------
>  kernel/sysctl.c          |    7 ++++++
>  4 files changed, 59 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d82accb..b40f4ef 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -406,6 +406,9 @@ struct mm_struct {
>  	 */
>  	unsigned long numa_next_scan;
>  
> +	/* Restart point for scanning and setting pte_numa */
> +	unsigned long numa_scan_offset;
> +
>  	/* numa_scan_seq prevents two threads setting pte_numa */
>  	int numa_scan_seq;
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 241e4f7..6b8a14f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
>  
>  extern unsigned int sysctl_balance_numa_scan_period_min;
>  extern unsigned int sysctl_balance_numa_scan_period_max;
> +extern unsigned int sysctl_balance_numa_scan_size;
>  extern unsigned int sysctl_balance_numa_settle_count;
>  
>  #ifdef CONFIG_SCHED_DEBUG
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9ea13e9..6df5620 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  
>  #ifdef CONFIG_BALANCE_NUMA
>  /*
> - * numa task sample period in ms: 5s
> + * numa task sample period in ms
>   */
> -unsigned int sysctl_balance_numa_scan_period_min = 5000;
> -unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
> +unsigned int sysctl_balance_numa_scan_period_min = 2000;
> +unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
> +
> +/* Portion of address space to scan in MB */
> +unsigned int sysctl_balance_numa_scan_size = 256;
>  
>  static void task_numa_placement(struct task_struct *p)
>  {
> @@ -822,6 +825,9 @@ void task_numa_work(struct callback_head *work)
>  	unsigned long migrate, next_scan, now = jiffies;
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
> +	struct vm_area_struct *vma;
> +	unsigned long offset, end;
> +	long length;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -851,18 +857,47 @@ void task_numa_work(struct callback_head *work)
>  	if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
>  		return;
>  
> -	ACCESS_ONCE(mm->numa_scan_seq)++;
> -	{
> -		struct vm_area_struct *vma;
> +	offset = mm->numa_scan_offset;
> +	length = sysctl_balance_numa_scan_size;
> +	length <<= 20;
>  
> -		down_read(&mm->mmap_sem);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			if (!vma_migratable(vma))
> -				continue;
> -			change_prot_numa(vma, vma->vm_start, vma->vm_end);
> -		}
> -		up_read(&mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, offset);
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
> +	}
> +	for (; vma && length > 0; vma = vma->vm_next) {
> +		if (!vma_migratable(vma))
> +			continue;
> +
> +		/* Skip small VMAs. They are not likely to be of relevance */
> +		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
> +			continue;
> +
> +		offset = max(offset, vma->vm_start);
> +		end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
> +		length -= end - offset;
> +
> +		change_prot_numa(vma, offset, end);
> +
> +		offset = end;
> +	}
> +
> +	/*
> +	 * It is possible to reach the end of the VMA list but the last few VMAs are
> +	 * not guaranteed to the vma_migratable. If they are not, we would find the
> +	 * !migratable VMA on the next scan but not reset the scanner to the start
> +	 * so check it now.
> +	 */
> +	if (!vma) {
> +		ACCESS_ONCE(mm->numa_scan_seq)++;
> +		offset = 0;
> +		vma = mm->mmap;
>  	}
> +	mm->numa_scan_offset = offset;
> +	up_read(&mm->mmap_sem);
>  }
>  
>  /*
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1359f51..d191203 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "balance_numa_scan_size_mb",
> +		.data		= &sysctl_balance_numa_scan_size,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
>  #endif /* CONFIG_BALANCE_NUMA */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
> -- 
> 1.7.9.2
> 
> --

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/