lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 27 Jun 2013 11:35:19 +0800
From:	Daniel J Blueman <daniel@...ascale-asia.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
CC:	Mike Travis <travis@....com>, "H. Peter Anvin" <hpa@...or.com>,
	Nathan Zimmer <nzimmer@....com>, holt@....com, rob@...dley.net,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>, yinghai@...nel.org,
	Greg KH <gregkh@...uxfoundation.org>, x86@...nel.org,
	linux-doc@...r.kernel.org,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Steffen Persvold <sp@...ascale.com>
Subject: Re: [RFC] Transparent on-demand memory setup initialization embedded
 in the (GFP) buddy allocator

On Wednesday, June 26, 2013 9:30:02 PM UTC+8, Andrew Morton wrote:
 >
 > On Wed, 26 Jun 2013 11:22:48 +0200 Ingo Molnar <mi...@...nel.org> wrote:
 >
 > > except that on 32 TB
 > > systems we don't spend ~2 hours initializing 8,589,934,592 page heads.
 >
 > That's about a million a second which is crazy slow - even my 
prehistoric desktop
 > is 100x faster than that.
 >
 > Where's all this time actually being spent?

The complexity of a directory-lookup architecture to make the 
(intrinsically unscalable) cache-coherency protocol scalable gives you a 
~1us roundtrip to remote NUMA nodes.

Probably a lot of time is spent in some memsets, and RMW cycles which 
are setting page bits, which are intrinsically synchronous, so the 
initialising core can't get to 12 or so outstanding memory transactions.

Since EFI memory ranges have a flag to state if they are zerod (which 
may be a fair assumption for memory on non-bootstrap processor NUMA 
nodes), we can probably collapse the RMWs to just writes.

A normal write will require a coherency cycle, then a fetch and a 
writeback when it's evicted from the cache. For this purpose, 
non-temporal writes would eliminate the cache line fetch and give a 
massive increase in bandwidth. We wouldn't even need a store-fence as 
the initialising core is the only one online.

Daniel
-- 
Daniel J Blueman
Principal Software Engineer, Numascale Asia
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ