lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 22 Feb 2009 20:38:17 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Tejun Heo <tj@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	rusty@...tcorp.com.au, tglx@...utronix.de, x86@...nel.org,
	linux-kernel@...r.kernel.org, hpa@...or.com, jeremy@...p.org,
	cpw@....com
Subject: Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator


* Tejun Heo <tj@...nel.org> wrote:

> Tejun Heo wrote:
> > I can remove the TLB problem from non-NUMA case but for NUMA I still
> > don't have a good idea.  Maybe we need to accept the overhead for
> > NUMA?  I don't know.
> 
> Hmmmm... one thing we can do on NUMA is to remap and free the 
> remapped address and make __pa() and __va() handle that area 
> specially.  It's a bit convoluted but the added overhead 
> should be minimal.  It'll only be simple range check in 
> __pa()/__va() and it's not like they are super hot paths 
> anyway.  I'll give it a shot.

Heck no. It is absolutely crazy to complicate __pa()/__va() in 
_any_ way just to 'save' one more 2MB dTLB.

We'll use that TLB because that is what TLBs are for: to handle 
mapped pages. Yes, in the percpu scheme we are working on we'll 
have a 'dual' mapping for the static percpu area (on 64-bit) but 
mapping aliases have been one of the most basic CPU features for 
the past 15 years ...

Even a single NOP in the __pa()/__va() path is _more_ expensive 
than that TLB, believe me.

Look at last year's cheap quad CPU:

 Data TLB: 4MB pages, 4-way associative, 32 entries

That's 32x2MB = 64MB of data reach. Our access patterns in the 
kernel tend to be pretty focused as well, so 32 is more than 
enough in practice.

Especially if the pte is cached a TLB fill is very cheap on 
Intel CPUs. So even if we were trashing those 32 entries (which 
we are generally not), having a dTLB for the percpu area is a 
TLB entry well spent.

So lets just do the most simple and most straightforward mapping 
approach which i suggested: it takes advantage of everything, is 
very close to the best possible performance in the cached case - 
and dont worry about hardware resources.

The moment you start worrying about hardware resources on that 
level and start 'optimizing' it in software, you've already lost 
it. It leads down to the path of soft-TLB handlers and other 
sillyness. There's no way you can win such a race against 
hardware fundamentals - at least at today's speed of advance in 
the hw space.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ