[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120321075349.GB24997@gmail.com>
Date: Wed, 21 Mar 2012 08:53:49 +0100
From: Ingo Molnar <mingo@...nel.org>
To: Dan Smith <danms@...ibm.com>
Cc: Andrea Arcangeli <aarcange@...hat.com>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...e.hu>, Paul Turner <pjt@...gle.com>,
Suresh Siddha <suresh.b.siddha@...el.com>,
Mike Galbraith <efault@....de>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Lai Jiangshan <laijs@...fujitsu.com>,
Bharata B Rao <bharata.rao@...il.com>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Rik van Riel <riel@...hat.com>,
Johannes Weiner <hannes@...xchg.org>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [RFC] AutoNUMA alpha6
* Dan Smith <danms@...ibm.com> wrote:
> On your numa01 test:
>
> Autonuma is 22% faster than mainline
> Numasched is 42% faster than mainline
>
> On Peter's modified stream_d test:
>
> Autonuma is 35% *slower* than mainline
> Numasched is 55% faster than mainline
>
> I know that the "real" performance guys here are going to be
> posting some numbers from more interesting benchmarks soon,
> but since nobody had answered Andrea's question, I figured I'd
> do it.
It would also be nice to find and run *real* HPC workloads that
were not written by Andrea or Peter and which computes something
non-trivial and real - and then compare the various methods.
Ideally we'd like to measure the two conceptual working set
corner cases:
- global working set HPC with a large shared working set:
- Many types of Monte-Carlo optimizations tend to be
like this - they have a large shared time series and
threads compute on those with comparatively little
private state.
- 3D rendering with physical modelling: a large, complex
3D scene set with private worker threads. (much of this
tends to be done in GPUs these days though.)
- private working set HPC with little shared/global working
set and lots of per process/thread private memory
allocations:
- Quantum chemistry optimization runs tend to be like this
with their often gigabytes large matrices.
- Gas, fluid, solid state and gravitational particle
simulations - most ab initio methods tend to have very
little global shared state, each thread iterates its own
version of the universe.
- More complex runs of ray tracing as well IIRC.
My impression is that while threading is on the rise due to its
ease of use, many threaded HPC workloads still fall into the
second category.
In fact they are often explicitly *turned* into the second
category at the application level by duplicating shared global
data explicitly and turning it into per thread local data.
So we need to cover these major HPC usecases - we won't merge
any of this based on just synthetic benchmarks.
And to default-enable any of this on stock kernels we'd need to
even more testing and widespread, feel-good speedups in almost
every key Linux workload... I don't see that happening though,
so the best we can get are probably some easy and flexible knobs
for HPC.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists