[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1271876480.7895.3106.camel@edumazet-laptop>
Date:	Wed, 21 Apr 2010 21:01:20 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	hadi@...erus.ca
Cc:	Changli Gao <xiaosuo@...il.com>, Rick Jones <rick.jones2@...com>,
	David Miller <davem@...emloft.net>, therbert@...gle.com,
	netdev@...r.kernel.org, robert@...julf.net, andi@...stfloor.org
Subject: Re: rps perfomance WAS(Re: rps: question
Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit :
> On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote:
> 
> 
> > I think your tests are very interesting, maybe could you publish them
> > somehow ? (I forgot to thank you about the previous report and nice
> > graph)
> > perf reports would be good too to help to spot hot points.
> 
> Ok ;->
> Let me explain my test setup (which some app types may gasp at;->):
> 
> SUT(system under test) was a nehalem single processor (4 cores, 2 SMT
> threads per core). 
> SUT runs a udp sink server i wrote (with apologies to Rick Jones[1])
> which forks at most a process per detected cpu and binds to a different
> udp port on each processor.
> Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin
> and varied the destination port to select a different flow on each of
> the outgoing packets. I could further increment the number of flows by
> varying the source address and source port number but in the end i 
> settled down to fixed srcip/srcport/destinationip and just varied the
> port number in order to simplify results collection.
> For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves
> out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT
> threads, cpu0 and cpu4 are SMT threads that reside on core0 and they
> steal execution cycles from each other - so i didnt want that to happen
> and instead tried to have as many of those cycles as possible for
> demuxing incoming packets.
> 
> Overall, in best case scenario rps had 5-7% better throughput than
> nonrps setup. It had upto 10% more cpu use and about 2-5% more latency.
> I am attaching some visualization of the way 8 flows were distributed
> around the different cpus. The diagrams show some samples - but what you
> see there was a good reflection of what i saw in many runs of the tests.
> Essentially, for localization is better with rps which gets better if
> you can somehow map the target cpus as selected by rps to what the app
> binds to.
> Ive also attached a small annotated perf output - sorry i didnt have
> time to dig deeper into the code; maybe later this week. I think my
> biggest problem in this setup was the sky2 driver or hardware poor
> ability to handle lots of traffic.
> 
> 
> cheers,
> jamal
> 
> [1] I want to hump on the SUT with tons of traffic and count packets;
> too complex to do with netperf
Thanks a lot Jamal, this is really useful
Drawback of using a fixed src ip from your generator is that all flows
share the same struct dst entry on SUT. This might explain some glitches
you noticed (ip_route_input + ip_rcv at high level on slave/application
cpus)
Also note your test is one way. If some data was replied we would see
much use of the 'flows'
I notice epoll_ctl() used a lot, are you re-arming epoll each time you
receive a datagram ?
I see slave/application cpus hit _raw_spin_lock_irqsave() and  
_raw_spin_unlock_irqrestore().
Maybe a ring buffer could help (instead of a double linked queue) for
backlog, or the double queue trick, if Changli wants to respin his
patch.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists
 
