linux-kernel - Re: NUMA performance comparison between three NUMA kernels and mainline. [Mid-size NUMA system edition.]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121210202933.GA15363@gmail.com>
Date:	Mon, 10 Dec 2012 21:29:33 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Mel Gorman <mgorman@...e.de>
Cc:	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Paul Turner <pjt@...gle.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Christoph Lameter <cl@...ux.com>,
	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Johannes Weiner <hannes@...xchg.org>,
	Hugh Dickins <hughd@...gle.com>,
	Arnaldo Carvalho de Melo <acme@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Mike Galbraith <efault@....de>
Subject: Re: NUMA performance comparison between three NUMA kernels and
 mainline. [Mid-size NUMA system edition.]


* Mel Gorman <mgorman@...e.de> wrote:

> > NUMA convergence latency measurements
> > -------------------------------------
> > 
> > 'NUMA convergence' latency is the number of seconds a 
> > workload takes to reach 'perfectly NUMA balanced' state. 
> > This is measured on the CPU placement side: once it has 
> > converged then memory typically follows within a couple of 
> > seconds.
> 
> This is a sortof misleading metric so be wary of it as the 
> speed a workload converges is not necessarily useful. It only 
> makes a difference for short-lived workloads or during phase 
> changes. If the workload is short-lived, it's not interesting 
> anyway. If the workload is rapidly changing phases then the 
> migration costs can be a major factor and rapidly converging 
> might actually be slower overall.
> 
> The speed the workload converges will depend very heavily on 
> when the PTEs are marked pte_numa and when the faults are 
> incurred. If this is happening very rapidly then a workload 
> will converge quickly *but* this can incur a high system CPU 
> cost (PTE scanning, fault trapping etc).  This metric can be 
> gamed by always scanning rapidly but the overall performance 
> may be worse.
> 
> I'm not saying that this metric is not useful, it is. Just be 
> careful of optimising for it. numacores system CPU usage has 
> been really high in a number of benchmarks and it may be 
> because you are optimising to minimise time to convergence.

You are missing a big part of the NUMA balancing picture here: 
the primary use of 'latency of convergence' is to determine 
whether a workload converges *at all*.

For example if you look at the 4-process / 8-threads-per-process 
latency results:

                            [ Lower numbers are better. ]
 
  [test unit]            :   v3.7 |balancenuma-v10|  AutoNUMA-v28 |   numa-u-v3   |
 ------------------------------------------------------------------------------------------
  4x8-convergence        :  101.1 |         101.3 |           3.4 |           3.9 |  secs

You'll see that balancenuma does not converge this workload. 

Where does such a workload matter? For example in the 4x JVM 
SPECjbb tests that Thomas Gleixner has reported today:

    http://lkml.org/lkml/2012/12/10/437

There balancenuma does worse than AutoNUMA and the -v3 tree 
exactly because it does not NUMA-converge as well (or at all).

> I'm trying to understand what you're measuring a bit better.  
> Take 1x4 for example -- one process, 4 threads. If I'm reading 
> this description then all 4 threads use the same memory. Is 
> this correct? If so, this is basically a variation of numa01 
> which is an adverse workload. [...]

No, 1x4 and 1x8 are like the SPECjbb JVM tests you have been 
performing - not an 'adverse' workload. The threads of the JVM 
are sharing memory significantly enough to justify moving them 
on the same node.

> [...]  balancenuma will not migrate memory in this case as 
> it'll never get past the two-stage filter. If there are few 
> threads, it might never get scheduled on a new node in which 
> case it'll also do nothing.
> 
> The correct action in this case is to interleave memory and 
> spread the tasks between nodes but it lacks the information to 
> do that. [...]

No, the correct action is to move related threads close to each 
other.

> [...] This was deliberate as I was expecting numacore or 
> autonuma to be rebased on top and I didn't want to collide.
> 
> Does the memory requirement of all threads fit in a single 
> node? This is related to my second question -- how do you 
> define convergence?

NUMA-convergence is to achieve the ideal CPU and memory 
placement of tasks.

> > The 'balancenuma' kernel does not converge any of the 
> > workloads where worker threads or processes relate to each 
> > other.
> 
> I'd like to know if it is because the workload fits on one 
> node. If the buffers are all really small, balancenuma would 
> have skipped them entirely for example due to this check
> 
>         /* Skip small VMAs. They are not likely to be of relevance */
>         if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR)
>                 continue;

No, the memory areas are larger than 2MB.

> Another possible explanation is that in the 4x4 case that the 
> processes threads are getting scheduled on separate nodes. As 
> each thread is sharing data it would not get past the 
> two-stage filter.
> 
> How realistic is it that threads are accessing the same data? 

In practice? Very ...

> That looks like it would be a bad idea even from a caching 
> perspective if the data is being updated. I would expect that 
> the majority of HPC workloads would have each thread accessing 
> mostly private data until the final stages where the results 
> are aggregated together.

You tested such a workload many times in the past: the 4x JVM 
SPECjbb test ...

> > NUMA workload bandwidth measurements
> > ------------------------------------
> > 
> > The other set of numbers I've collected are workload 
> > bandwidth measurements, run over 20 seconds. Using 20 
> > seconds gives a healthy mix of pre-convergence and 
> > post-convergence bandwidth,
> 
> 20 seconds is *really* short. That might not even be enough 
> time for autonumas knumad thread to find the process and 
> update it as IIRC it starts pretty slowly.

If you check the convergence latency tables you'll see that 
AutoNUMA is able to converge within 20 seconds.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/