linux-kernel - Re: [RFC] AutoNUMA alpha6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120321075349.GB24997@gmail.com>
Date:	Wed, 21 Mar 2012 08:53:49 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Dan Smith <danms@...ibm.com>
Cc:	Andrea Arcangeli <aarcange@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, Paul Turner <pjt@...gle.com>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	Mike Galbraith <efault@....de>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Bharata B Rao <bharata.rao@...il.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC] AutoNUMA alpha6


* Dan Smith <danms@...ibm.com> wrote:

> On your numa01 test:
> 
>   Autonuma is 22% faster than mainline
>   Numasched is 42% faster than mainline
> 
> On Peter's modified stream_d test:
> 
>   Autonuma is 35% *slower* than mainline
>   Numasched is 55% faster than mainline
> 
> I know that the "real" performance guys here are going to be 
> posting some numbers from more interesting benchmarks soon, 
> but since nobody had answered Andrea's question, I figured I'd 
> do it.

It would also be nice to find and run *real* HPC workloads that 
were not written by Andrea or Peter and which computes something 
non-trivial and real - and then compare the various methods.

Ideally we'd like to measure the two conceptual working set 
corner cases:

  - global working set HPC with a large shared working set:

      - Many types of Monte-Carlo optimizations tend to be
        like this - they have a large shared time series and
        threads compute on those with comparatively little
        private state.

      - 3D rendering with physical modelling: a large, complex
        3D scene set with private worker threads. (much of this 
        tends to be done in GPUs these days though.)

  - private working set HPC with little shared/global working 
    set and lots of per process/thread private memory 
    allocations:

      - Quantum chemistry optimization runs tend to be like this
        with their often gigabytes large matrices.

      - Gas, fluid, solid state and gravitational particle
        simulations - most ab initio methods tend to have very
        little global shared state, each thread iterates its own
        version of the universe.

      - More complex runs of ray tracing as well IIRC.

My impression is that while threading is on the rise due to its 
ease of use, many threaded HPC workloads still fall into the 
second category.

In fact they are often explicitly *turned* into the second 
category at the application level by duplicating shared global 
data explicitly and turning it into per thread local data.

So we need to cover these major HPC usecases - we won't merge 
any of this based on just synthetic benchmarks.

And to default-enable any of this on stock kernels we'd need to 
even more testing and widespread, feel-good speedups in almost 
every key Linux workload... I don't see that happening though, 
so the best we can get are probably some easy and flexible knobs 
for HPC.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/