netdev - Re: rps perfomance WAS(Re: rps: question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1271426326.4606.83.camel@bigi>
Date:	Fri, 16 Apr 2010 09:58:46 -0400
From:	jamal <hadi@...erus.ca>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Changli Gao <xiaosuo@...il.com>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Rick Jones <rick.jones2@...com>,
	David Miller <davem@...emloft.net>, therbert@...gle.com,
	netdev@...r.kernel.org, robert@...julf.net
Subject: Re: rps perfomance WAS(Re: rps: question

On Fri, 2010-04-16 at 15:37 +0200, Andi Kleen wrote:
> On Fri, Apr 16, 2010 at 09:27:35AM -0400, jamal wrote:

> > So you are saying that the old implementation of IPI (likely what i
> > tried pre-napi and as recent as 2-3 years ago) was bad because of a
> > single lock?
> 
> Yes.

> The old implementation of smp_call_function. Also in the really old
> days there was no smp_call_function_single() so you tended to broadcast.
> 
> Jens did a lot of work on this for his block device work IPI implementation.

Nice - thanks for that info! So not only has h/ware improved, but
implementation as well..

> > On IPIs:
> > Is anyone familiar with what is going on with Nehalem? Why is it this
> > good? I expect things will get a lot nastier with other hardware like
> > xeon based or even Nehalem with rps going across QPI.
> 
> Nehalem is just fast. I don't know why it's fast in your specific
> case. It might be simply because it has lots of bandwidth everywhere.
> Atomic operations are also faster than on previous Intel CPUs.

Well, the cache architecture is nicer. The on-die MC is nice. No more
shared MC hub/FSB. The 3 MC channels are nice. Intel finally beating
AMD ;-> someone did a measurement of the memory timings (L1, L2, L3, MM
and the results were impressive; i have the numbers somewhere).

> 
> > Here's why i think IPIs are bad, please correct me if i am wrong:
> > - they are synchronous. i.e an IPI issuer has to wait for an ACK (which
> > is in the form of an IPI).
> 
> In the hardware there's no ack, but in the Linux implementation there
> is usually (because need to know when to free the stack state used
> to pass information)
>
> However there's also now support for queued IPI
> with a special API (I believe Tom is using that)
> 

Which is the non-queued-IPI call?

> > - data cache has to be synced to main memory
> > - the instruction pipeline is flushed
> 
> At least on Nehalem data transfer can be often through the cache.

I thought you have to go all the way to MM in case of IPIs.

> IPIs involve APIC accesses which are not very fast (so overall
> it's far more than a pipeline worth of work), but it's still
> not a incredible expensive operation.
> 
> There's also X2APIC now which should be slightly faster, but it's 
> likely not in your Nehalem (this is only in the highend Xeon versions)
> 

Ok, true - forgot about the APIC as well...

> > Do you know any specs i could read up which will tell me a little more?
> 
> If you're just interested in IPI and cache line transfer performance it's
> probably best to just measure it.

There are tools like benchit which would give me L1,2,3,MM measurements;
for IPI the ping + rps test i did maybe sufficient.

> Some general information is always in the Intel optimization guide.

Thanks Andi!

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html