linux-kernel - Re: [PATCH 1/8] THP: Use real address for NUMA policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 9 Sep 2013 11:48:23 -0500
From:	Alex Thorlton <athorlton@....com>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Robin Holt <robinmholt@...il.com>,
	"Kirill A. Shutemov" <kirill@...temov.name>,
	Dave Hansen <dave.hansen@...el.com>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>,
	"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	"Eric W . Biederman" <ebiederm@...ssion.com>,
	Sedat Dilek <sedat.dilek@...il.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Dave Jones <davej@...hat.com>,
	Michael Kerrisk <mtk.manpages@...il.com>,
	"Paul E . McKenney" <paulmck@...ux.vnet.ibm.com>,
	David Howells <dhowells@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Al Viro <viro@...iv.linux.org.uk>,
	Oleg Nesterov <oleg@...hat.com>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Kees Cook <keescook@...omium.org>
Subject: Re: [PATCH 1/8] THP: Use real address for NUMA policy

On Thu, Sep 05, 2013 at 01:15:10PM +0200, Ingo Molnar wrote:
> 
> * Alex Thorlton <athorlton@....com> wrote:
> 
> > > Robin,
> > > 
> > > I tweaked one of our other tests to behave pretty much exactly as I
> > > - malloc a large array
> > > - Spawn a specified number of threads
> > > - Have each thread touch small, evenly spaced chunks of the array (e.g.
> > >   for 128 threads, the array is divided into 128 chunks, and each thread
> > >   touches 1/128th of each chunk, dividing the array into 16,384 pieces)
> > 
> > Forgot to mention that the threads don't touch their chunks of memory
> > concurrently, i.e. thread 2 has to wait for thread 1 to finish first.
> > This is important to note, since the pages won't all get stuck on the
> > first node without this behavior.
> 
> Could you post the testcase please?
> 
> Thanks,
> 
> 	Ingo

Sorry for the delay here, had to make sure that everything in my tests
was okay to push out to the public.  Here's a pointer to the test I
wrote:

ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz

Everything to compile the test should be there (just run make in the
thp_pthread directory).  To run the test use something like:

time ./thp_pthread -C 0 -m 0 -c <max_cores> -b <memory>

I ran:

time ./thp_pthread -C 0 -m 0 -c 128 -b 128g

On a 256 core machine, with ~500gb of memory and got these results:

THP off:

real	0m57.797s
user	46m22.156s
sys	6m14.220s

THP on:

real	1m36.906s
user	0m2.612s
sys	143m13.764s

I snagged some code from another test we use, so I can't vouch for the
usefulness/accuracy of all the output (actually, I know some of it is
wrong).  I've mainly been looking at the total run time.

Don't want to bloat this e-mail up with too many test results, but I
found this one really interesting.  Same machine, using all the cores,
with the same amount of memory.  This means that each cpu is actually
doing *less* work, since the chunk we reserve gets divided up evenly
amongst the cpus:

time ./thp_pthread -C 0 -m 0 -c 256 -b 128g

THP off:

real	1m1.028s
user	104m58.448s
sys	8m52.908s

THP on:

real	2m26.072s
user	60m39.404s
sys	337m10.072s

Seems that the test scales really well in the THP off case, but, once
again, with THP on, we really see the performance start to degrade.

I'm planning to start investigating possible ways to split up THPs, if
we detect that that majority of the references to a THP are off-node.
I've heard some horror stories about migrating pages in this situation
(i.e., process switches cpu and then all the pages follow it), but I
think we might be able to get some better results if we can cleverly
determine an appropriate time to split up pages.  I've heard a bit of
talk about doing something similar to this from a few people, but
haven't seen any code/test results.  If anybody has any input on that
topic, it would be greatly appreciated.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/