netdev - Re: rps perfomance WAS(Re: rps: question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1271583573.16881.4798.camel@edumazet-laptop>
Date:	Sun, 18 Apr 2010 11:39:33 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	hadi@...erus.ca
Cc:	Changli Gao <xiaosuo@...il.com>, Rick Jones <rick.jones2@...com>,
	David Miller <davem@...emloft.net>, therbert@...gle.com,
	netdev@...r.kernel.org, robert@...julf.net, andi@...stfloor.org
Subject: Re: rps perfomance WAS(Re: rps: question

Le samedi 17 avril 2010 à 13:31 -0400, jamal a écrit :
> On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote:
> 
> > I did some tests on a dual quad core machine (E5450  @ 3.00GHz), not
> > nehalem. So a 3-4 years old design.
> 
> Eric, I thank you kind sir for going out of your way to do this - it is
> certainly a good processor to compare against 
> 
> > For all test, I use the best time of 3 runs of "ping -f -q -c 100000
> > 192.168.0.2". Yes ping is not very good, but its available ;)
> 
> It is a reasonable quick test, no fancy setup required ;->
> 
> > Note: I make sure all 8 cpus of target are busy, eating cpu cycles in
> > user land. 
> 
> I didnt keep the cpus busy. I should re-run with such a setup, any
> specific app that you used to keep them busy? Keeping them busy could
> have consequences;  I am speculating you probably ended having greater
> than one packet/IPI ratio i.e amortization benefit..

No, only one packet per IPI, since I setup my tg3 coalescing parameter
to the minimum value, I received one packet per interrupt.

The specific app is :

for f in `seq 1 8`; do while :; do :; done& done


>   
> > I dont want to tweak acpi or whatever smart power saving
> > mechanisms.
> 
> I should mention i turned off acpi as well in the bios; it was consuming
> more cpu cycles than net-processing and was interfering in my tests.
> 
> > When RPS off
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms
> > 
> > RPS on, but directed on the cpu0 handling device interrupts (tg3, napi)
> > (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms
> > 
> > So the cost of queing the packet into our own queue (netif_receive_skb
> > -> enqueue_to_backlog) is about 0.74 us  (74 ms / 100000)
> > 
> 
> Excellent analysis.
> 
> > I personally think we should process packet instead of queeing it, but
> > Tom disagree with me.
> 
> Sorry - I am gonna have to turn on some pedagogy and offer my
> Canadian 2 cents;->
> I would lean on agreeing with Tom, but maybe go one step further (sans
> packet-reordering): we should never process packets to socket layer on
> the demuxing cpu.
> enqueue everything you receive on a different cpu - so somehow receiving
> cpu becomes part of a hashing decision ...
> 
> The reason is derived from queueing theory - of which i know dangerously
> little - but refer you to mr. little his-self[1] (pun fully
> intended;->):
> i.e fixed serving time provides more predictable results as opposed to
> once in a while a spike as you receive packets destined to "our cpu".
> Queueing packets and later allocating cycles to processing them adds to
> variability, but is not as bad as processing to completion to socket
> layer.
> 
> > RPS on, directed on cpu1 (other socket)
> > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus)
> > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms
> 
> Good test - should be worst case scenario. But there are two other 
> scenarios which will give different results in my opinion.
> On your setup i think each socket has two dies, each with two cores. So
> my feeling is you will get different numbers if you go within same die
> and across dies within same socket. If i am not mistaken, the mapping
> would be something like socket0/die0{core0/2}, socket0/die1{core4/6},
> socket1/die0{core1/3}, socket1{core5/7}.
> If you have cycles can you try the same socket+die but different cores
> and same socket but different die test?

Sure, lets redo a full test, taking lowest time of three ping runs


echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms

echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms

echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms

echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms

echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms

echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms

echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms

echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus
100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms


# egrep "physical id|core|apicid" /proc/cpuinfo 
physical id	: 0
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0

physical id	: 1
core id		: 0
cpu cores	: 4
apicid		: 4
initial apicid	: 4

physical id	: 0
core id		: 2
cpu cores	: 4
apicid		: 2
initial apicid	: 2

physical id	: 1
core id		: 2
cpu cores	: 4
apicid		: 6
initial apicid	: 6

physical id	: 0
core id		: 1
cpu cores	: 4
apicid		: 1
initial apicid	: 1

physical id	: 1
core id		: 1
cpu cores	: 4
apicid		: 5
initial apicid	: 5

physical id	: 0
core id		: 3
cpu cores	: 4
apicid		: 3
initial apicid	: 3

physical id	: 1
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html