lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150911150544.GL25655@suse.de>
Date:	Fri, 11 Sep 2015 16:05:44 +0100
From:	Mel Gorman <mgorman@...e.de>
To:	Rik van Riel <riel@...hat.com>
Cc:	linux-kernel@...r.kernel.org, peterz@...radead.org,
	mingo@...nel.org, Andrea Arcangeli <aarcange@...hat.com>,
	Jan Stancek <jstancek@...hat.com>
Subject: Re: [PATCH] sched,numa: limit amount of virtual memory scanned in
 task_numa_work

On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
> of memory per invocation, but only counts memory areas that have at
> least one PTE that is still present and not marked for numa hint faulting.
> 
> It will skip over arbitarily large amounts of memory that are either
> unused, full of swap ptes, or full of PTEs that were already marked
> for NUMA hint faults but have not been faulted on yet.
> 

This was deliberate and intended to cover a case whereby a process sparsely
using the address space would quickly skip over the sparse portions and
reach the active portions. Obviously you've found that this is not always
a great idea.

> This can cause excessive amounts of CPU use, due to there being
> essentially no upper limit on the scan rate of very large processes
> that are not yet in a phase where they are actively accessing old
> memory pages (eg. they are still initializing their data).
> 
> Avoid that problem by placing an upper limit on the amount of virtual
> memory that task_numa_work scans in each invocation. This can be a
> higher limit than "pages", to ensure the task still skips over unused
> areas fairly quickly.
> 
> While we are here, also fix the "nr_pte_updates" logic, so it only
> counts page ranges with ptes in them.
> 
> Signed-off-by: Rik van Riel <riel@...hat.com>
> Reported-by: Andrea Arcangeli <aarcange@...hat.com>
> Reported-by: Jan Stancek <jstancek@...hat.com>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..ff51b559ccaf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
>  	struct vm_area_struct *vma;
>  	unsigned long start, end;
>  	unsigned long nr_pte_updates = 0;
> -	long pages;
> +	long pages, virtpages;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
>  	start = mm->numa_scan_offset;
>  	pages = sysctl_numa_balancing_scan_size;
>  	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
> +	virtpages = pages * 8;	   /* Scan up to this much virtual space */
>  	if (!pages)
>  		return;
>  
> +
>  	down_read(&mm->mmap_sem);
>  	vma = find_vma(mm, start);
>  	if (!vma) {


> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
>  			start = max(start, vma->vm_start);
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>  			end = min(end, vma->vm_end);
> -			nr_pte_updates += change_prot_numa(vma, start, end);
> +			nr_pte_updates = change_prot_numa(vma, start, end);
>  

Are you *sure* about this particular change?

The intent is that sparse space be skipped until the first updated PTE
is found and then scan sysctl_numa_balancing_scan_size pages after that.
With this change, if we find a single PTE in the middle of a sparse space
than we stop updating pages in the nr_pte_updates check below. You get
protected from a lot of scanning by the virtpages check but it does not
seem this fix is necessary.  It has an odd side-effect whereby we possible
scan more with this patch in some cases.

>  			/*
> -			 * Scan sysctl_numa_balancing_scan_size but ensure that
> -			 * at least one PTE is updated so that unused virtual
> -			 * address space is quickly skipped.
> +			 * Try to scan sysctl_numa_balancing_size worth of
> +			 * hpages that have at least one present PTE that
> +			 * is not already pte-numa. If the VMA contains
> +			 * areas that are unused or already full of prot_numa
> +			 * PTEs, scan up to virtpages, to skip through those
> +			 * areas faster.
>  			 */
>  			if (nr_pte_updates)
>  				pages -= (end - start) >> PAGE_SHIFT;
> +			virtpages -= (end - start) >> PAGE_SHIFT;
>  

It's a pity there will potentially be a lot of useless dead scanning on
those processes but caching start addresses is both outside the scope of
this patch and has its own problems.

>  			start = end;
> -			if (pages <= 0)
> +			if (pages <= 0 || virtpages <= 0)
>  				goto out;
>  
>  			cond_resched();

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ