netdev - Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 29 Jan 2010 00:02:19 -0800
From:	Andrew Dickinson <andrew@...dna.net>
To:	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>
Cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: receive-side performance issue (ixgbe, core-i7, softirq cpu%)

I might have mis-spoken about HPET.

The 4.6Mpps is with 2.6.32.4 vanilla, HPET on.

Either way, I'm happy now ;-P

-A

On Thu, Jan 28, 2010 at 10:06 PM, Andrew Dickinson <andrew@...dna.net> wrote:
> Short response: CONFIG_HPET was the dirty little bastard!
>
> Answering your questions below in case somebody else stumbles across
> this thread...
>
> On Thu, Jan 28, 2010 at 4:18 PM, Brandeburg, Jesse
> <jesse.brandeburg@...el.com> wrote:
>>
>>
>> On Thu, 28 Jan 2010, Andrew Dickinson wrote:
>>> I'm running into some unexpected performance issues.  I say
>>> "unexpected" because I was running the same tests on this same box 5
>>> months ago and getting very different (and much better) results.
>>
>>
>> can you try turning off cpuspeed service, C-States in BIOS, and GV3 (aka
>> speedstep) support in BIOS?
>
> Yup, everything's on "maximum performance" in my BIOS's vernacular (HP
> GL360g6) no C-states, etc.
>
>> Have you upgraded your BIOS since before?
>
> Not that I'm aware of, but our provisioning folks might have done
> something crazy.
>
>> I agree you should be able to see better numbers, I suspect that you are
>> getting cross-cpu traffic that is limiting your throughput.
>
> That's what I would have suspected as well.
>
>> How many flows are you pushing?
>
> I'm pushing two streams of traffic, one in each direction.  Each
> stream is defined as follows:
>    North-bound:
>        L2: a0a0a0a0a0a0 -> b0b0b0b0b0b0
>        L3: RAND(10.0.0.0/16) -> RAND(100.0.0.0/16)
>        L4: UDP with random data
>    South-bound is the reverse.
>
>    where "RAND(CIDR)" is a random address within that CIDR (I'm using
> an hardware traffic generator).
>
>> Another idea is to compile the "perf" tool in the tools/perf directory of
>> the kernel and run "perf record -a -- sleep 10" while running at steady
>> state.  then show output of perf report to get an idea of which functions
>> are eating all the cpu time.
>>
>> did you change to the "tickless" kernel?  We've also found that routing
>> performance improves dramatically by disabling tickless, preemptive kernel
>> and setting HZ=100.  What about CONFIG_HPET?
>
> yes, yes, yes, and no...
>
> changed CONFIG_HPET to n, rebooted and retested....
>
> ta-da!
>
>> You should try the kernel that the scheduler fixes went into (maybe 31?)
>> or at least try 2.6.32.6 so you've tried something fully up to date.
>
> I'll give it a whirl :D
>
>>> === Background ===
>>>
>>> The box is a dual Core i7 box with a pair of Intel 82598EB's.  I'm
>>> running 2.6.30 with the in-kernel ixgbe driver.  My tests 5 months ago
>>> were using 2.6.30-rc3 (with a tiny patch from David Miller as seen
>>> here: http://kerneltrap.org/mailarchive/linux-netdev/2009/4/30/5605924).
>>>  The box is configured with both NICs in a bridge; normally I'm doing
>>> some packet processing using ebtables, but for the sake of keeping
>>> things simple, I'm not doing anything special.. just straight bridging
>>> (no ebtables rules, etc).  I'm not running irqbalance and instead
>>> pinning my interrupts, one per core.  I've re-read and double checked
>>> various settings based on Intel's README (i.e. gso off, tso off, etc).
>>>
>>> In my previous tests, i was able to pass 3+Mpps regardless of how that
>>> was divided across the two NICS (i.e. 3Mpps all in one direction,
>>> 1.5Mpps in each direction simultaneously, etc).  Now, I'm hardly able
>>> to exceed about 750kpps x 2 (i.e. 750k in both directions), and I
>>> can't do more than 750kpps in one direction even with the other
>>> direction having no traffic).
>>>
>>> Unfortunately, I didn't take very good notes when I did this last time
>>> so I don't have my previous .config and I'm not 100% positive I've got
>>> identical ethtool settings, etc.  That being said, I've worked through
>>> seemingly every combination of factors that I can think of and I'm
>>> still unable to see the old performance (NUMA on/off, Hyperthreading
>>> on/off, various irq coelescing settings, etc).
>>>
>>> I have two identical boxes, they both see the same thing; so a
>>> hardware issue seems unlikely.  My next step is to grab 2.6.30-rc3 and
>>> see if I can repro the good performance with that kernel again and
>>> determine if there was a regression between 2.6.30-rc3 and 2.6.30...
>>> but I'm skeptical that that's the issue since I'm sure other people
>>> would have noticed this as well.
>>>
>>>
>>> === What I'm seeing ===
>>>
>>> CPU% (almost entirely softirq time, which is expected) ramps extremely
>>> quickly as packet rate increases.  The following table show the packet
>>> rate ("150 x 2" means 150kpps in each direction simultaneously), the
>>> right side is the cpu utilization (as measured by %si in top).
>>>
>>> 150 x 2:   4%
>>> 300 x 2:   8%
>>> 450 x 2:  18%
>>> 483 x 2:  50%
>>> 525 x 2:  66%
>>> 600 x 2:  85%
>>> 750 x 2: 100% (and dropping frames)
>>>
>>> I _am_ seeing interrupts getting spread nicely across cores, so in the
>>> "150 x 2" case, that's about 4% soft-interrupt time per each of the 16
>>> cores.   The CPUs are otherwise idle bar a small amount of hardware
>>> interrupt time (less than 1%).
>>>
>>>
>>> === Where it gets weird... ===
>>>
>>> Trying to isolate the problem, I added an ebtables rule to drop
>>> everything on the forward chain.  I was expecting to see the CPU
>>> utilization drop since I'd no longer be dealing with the TX-side... no
>>> change.
>>>
>>> I then decided to switch from a bridge to a route-based solution.  I
>>> tore down the bridge, enabled ip_forward, setup some IPs and route
>>> entries, etc.  Nothing changes.  CPU performance is identical to
>>> what's shown above.  Additionally, if I add an iptables drop on
>>> FORWARD, the CPU utilization remains unchanged (just like in the
>>> bridging case above).
>>>
>>> The point that [I think] I'm driving to is that there's something
>>> fishy going on with the receive-side of the packets.  I wish I could
>>> point to something more specific or a section of code, but I haven't
>>> been able to par this down to anything more granular in my testing.
>>>
>>>
>>> === Questions ===
>>>
>>> Has anybody seen this before?  If so, what was wrong?
>>> Do you have any recommendations on things to try (either as guesses
>>> or, even better, to help eliminate possibilities)
>>> And along those lines... can anybody think of any possible reasons for this?
>>
>> hope the above helped.
>>
>>> This is so frustrating since I _know_ this hardware is capable of so
>>> much more.  It's relatively painless for me to re-run tests in my lab,
>>> so feel free to throw something at me that you think will stick :D
>>
>> last I checked, I recall with 82599 I was pushing ~4.5 million 64 byte
>> packets a second (bidirectional, no drop), after disabling irqbalance and
>> 16 tx/rx queues set with set_irq_affinity.sh script (available in our
>> ixgbe-foo.tar.gz from sourceforge).  82598 should be a bit lower, but
>> probably can get close to that number.
>>
>> I haven't run the test lately though, but at that point I was likely on
>> 2.6.30 ish
>>
>> Jesse
>>
>
> Thank you so much... I wish I'd sent this email out a week ago ;-P
>
> -A
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html