linux-kernel - Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090222193817.GC21320@elte.hu>
Date:	Sun, 22 Feb 2009 20:38:17 +0100
From:	Ingo Molnar <mingo@...e.hu>
To:	Tejun Heo <tj@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	rusty@...tcorp.com.au, tglx@...utronix.de, x86@...nel.org,
	linux-kernel@...r.kernel.org, hpa@...or.com, jeremy@...p.org,
	cpw@....com
Subject: Re: [PATCHSET x86/core/percpu] implement dynamic percpu allocator

* Tejun Heo <tj@...nel.org> wrote:

> Tejun Heo wrote:
> > I can remove the TLB problem from non-NUMA case but for NUMA I still
> > don't have a good idea.  Maybe we need to accept the overhead for
> > NUMA?  I don't know.
> 
> Hmmmm... one thing we can do on NUMA is to remap and free the 
> remapped address and make __pa() and __va() handle that area 
> specially.  It's a bit convoluted but the added overhead 
> should be minimal.  It'll only be simple range check in 
> __pa()/__va() and it's not like they are super hot paths 
> anyway.  I'll give it a shot.

Heck no. It is absolutely crazy to complicate __pa()/__va() in 
_any_ way just to 'save' one more 2MB dTLB.

We'll use that TLB because that is what TLBs are for: to handle 
mapped pages. Yes, in the percpu scheme we are working on we'll 
have a 'dual' mapping for the static percpu area (on 64-bit) but 
mapping aliases have been one of the most basic CPU features for 
the past 15 years ...

Even a single NOP in the __pa()/__va() path is _more_ expensive 
than that TLB, believe me.

Look at last year's cheap quad CPU:

 Data TLB: 4MB pages, 4-way associative, 32 entries

That's 32x2MB = 64MB of data reach. Our access patterns in the 
kernel tend to be pretty focused as well, so 32 is more than 
enough in practice.

Especially if the pte is cached a TLB fill is very cheap on 
Intel CPUs. So even if we were trashing those 32 entries (which 
we are generally not), having a dTLB for the percpu area is a 
TLB entry well spent.

So lets just do the most simple and most straightforward mapping 
approach which i suggested: it takes advantage of everything, is 
very close to the best possible performance in the cached case - 
and dont worry about hardware resources.

The moment you start worrying about hardware resources on that 
level and start 'optimizing' it in software, you've already lost 
it. It leads down to the path of soft-TLB handlers and other 
sillyness. There's no way you can win such a race against 
hardware fundamentals - at least at today's speed of advance in 
the hw space.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/