linux-kernel - Re: AutoNUMA15

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120626120325.GA25956@redhat.com>
Date:	Tue, 26 Jun 2012 14:03:25 +0200
From:	Andrea Arcangeli <aarcange@...hat.com>
To:	Alex Shi <alex.shi@...el.com>
Cc:	Alex Shi <lkml.alex@...il.com>, Petr Holasek <pholasek@...hat.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Hillf Danton <dhillf@...il.com>, Dan Smith <danms@...ibm.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...e.hu>, Paul Turner <pjt@...gle.com>,
	Suresh Siddha <suresh.b.siddha@...el.com>,
	Mike Galbraith <efault@....de>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Lai Jiangshan <laijs@...fujitsu.com>,
	Bharata B Rao <bharata.rao@...il.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Srivatsa Vaddagiri <vatsa@...ux.vnet.ibm.com>,
	Christoph Lameter <cl@...ux.com>,
	"Chen, Tim C" <tim.c.chen@...el.com>
Subject: Re: AutoNUMA15

On Tue, Jun 26, 2012 at 03:52:26PM +0800, Alex Shi wrote:
> Could you like to give a url for the benchmarks?

I posted them to lkml a few months ago, I'm attaching them here. There
is actually a more polished version around that I didn't have time to
test yet. For now I'm attaching the old version here that I'm still
using to verify the regressions.

If you edit the .c files to make the right hard/inverse binds, and
then build with -DHARD_BIND and later -DINVERSE_BIND you can measure
the hardware NUMA effects on your hardware. numactl --hardware will
give you the topology to check if the code is ok for your hardware.

> memory). find the openjdk has about 2% regression, while jrockit has no

2% regression is in the worst case the numa hinting page faults (or in
the best case a measurement error) when you get no benefit from the
vastly increased NUMA affinity.

You can reduce that overhead to below 1% by multiplying by 2/3 times
the /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_millisecs and
/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs .
Especially the latter if set to 15000 will reduce the overhead by 1%.

The current AutoNUMA defaults are hyper aggressive, with benchmarks
running for several minutes you can easily reduce AutoNUMA
aggressiveness to pay a lower fixed cost in the numa hinting page
faults without reducing overall performance.

The boost when you use AutoNUMA is >20%, sometime as high as 100%, so
the 2% is lost in the noise, but over time we should reduce it
(especially with hypervisor tuned profile for those cloud nodes which
only run virtual machines in turn with quite constant loads where
there's no need to react that fast).

> the testing user 2 instances, each of them are pinned to a node. some
> setting is here:

Ok the problem is that you must not pin anything. If you hard pin
AutoNUMA won't do anything on those processes.

It is impossible to run faster than the raw hard pinning, impossible
because AutoNUMA has also to migrate memory, hard pinning avoids all
memory migrations.

AutoNUMA aims to achieve as close performance to hard pinning as
possible without having to user hard pinning, that's the whole point.

So this explains why you measure a 2% regression or no difference,
with hard pins used at all times only the AutoNUMA worst case overhead
can be measured (and I explained above how it can be reduced).

A plan I can suggest for this benchmark is this:

1) "upstream default"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y

2) "autonuma"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=y
  - CONFIG_AUTONUMA_DEFAULT_ENABLED=y
  - CONFIG_TRANSPARENT_HUGEPAGE=y

3) "autonuma lower numa hinting page fault overhead"
  - no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
  - no hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=y
  - CONFIG_AUTONUMA_DEFAULT_ENABLED=y
  - CONFIG_TRANSPARENT_HUGEPAGE=y
  - echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs

4) "upstream hard pinning and transparent hugepage"
  - hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y

5) "upstream hard pinning and hugetlbfs"
  - hugetlbfs
  - hard pinning of CPUs or memory to nodes
  - CONFIG_AUTONUMA=n
  - CONFIG_TRANSPARENT_HUGEPAGE=y (y/n won't matter if you use hugetlbfs)

Then you can compare 1/2/3/4/5.

The minimum to make a meaningful comparison is 1 vs 2. The next best
comparison is 1 vs 2 vs 4 (4 is very useful reference too because the
closer AutoNUMA gets to 4 the better! beating 1 is trivial, getting
very close to 4 is less easy because 4 isn't migrating any memory).

Running 3 and 5 is optional, especially I mentioned 5 just because you
liked to run it with hugetlbfs and not just THP.

> jrockit use hugetlb and its options:

hugetlbfs should be disabled when AutoNUMA is enabled because AutoNUMA
won't try to migrate hugetlbfs memory, not that it makes any
difference if the memory is hard pinned. THP should deliver the same
performance of hugetlbfs for the JVM and THP memory can be migrated by
AutoNUMA (as well as mmapped not-shared pagecache, not just anon
memory).

Thanks a lot, and looking forward to see how things goes when you
remove the hard pins.

Andrea

View attachment "numa01.c" of type "text/x-c" (3046 bytes)

View attachment "numa02.c" of type "text/x-c" (2139 bytes)