[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49FAA55D.7070406@cosmosbay.com>
Date: Fri, 01 May 2009 09:31:41 +0200
From: Eric Dumazet <dada1@...mosbay.com>
To: Andrew Dickinson <andrew@...dna.net>
CC: David Miller <davem@...emloft.net>, jelaas@...il.com,
netdev@...r.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)
Andrew Dickinson a écrit :
> On Thu, Apr 30, 2009 at 11:40 PM, Eric Dumazet <dada1@...mosbay.com> wrote:
>> Andrew Dickinson a écrit :
>>> On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@...mosbay.com> wrote:
>>>> Andrew Dickinson a écrit :
>>>>> OK... I've got some more data on it...
>>>>>
>>>>> I passed a small number of packets through the system and added a ton
>>>>> of printks to it ;-P
>>>>>
>>>>> Here's the distribution of values as seen by
>>>>> skb_rx_queue_recorded()... count on the left, value on the right:
>>>>> 37 0
>>>>> 31 1
>>>>> 31 2
>>>>> 39 3
>>>>> 37 4
>>>>> 31 5
>>>>> 42 6
>>>>> 39 7
>>>>>
>>>>> That's nice and even.... Here's what's getting returned from the
>>>>> skb_tx_hash(). Again, count on the left, value on the right:
>>>>> 31 0
>>>>> 81 1
>>>>> 37 2
>>>>> 70 3
>>>>> 37 4
>>>>> 31 6
>>>>>
>>>>> Note that we're entirely missing 5 and 7 and that those interrupts
>>>>> seem to have gotten munged onto 1 and 3.
>>>>>
>>>>> I think the voodoo lies within:
>>>>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>>>>
>>>>> David, I made the change that you suggested:
>>>>> //hash = skb_get_rx_queue(skb);
>>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>
>>>>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>>>>
>>>>> However, my problem's not solved entirely... here's what top is showing me:
>>>>> top - 23:37:49 up 9 min, 1 user, load average: 3.93, 2.68, 1.21
>>>>> Tasks: 119 total, 5 running, 114 sleeping, 0 stopped, 0 zombie
>>>>> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st
>>>>> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st
>>>>> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
>>>>> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 4.3%hi, 95.7%si, 0.0%st
>>>>> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.3%hi, 0.3%si, 0.0%st
>>>>> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 2.0%id, 0.0%wa, 4.0%hi, 94.0%si, 0.0%st
>>>>> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
>>>>> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 5.6%id, 0.0%wa, 2.3%hi, 92.1%si, 0.0%st
>>>>> Mem: 16403476k total, 335884k used, 16067592k free, 10108k buffers
>>>>> Swap: 2096472k total, 0k used, 2096472k free, 146364k cached
>>>>>
>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>> 7 root 15 -5 0 0 0 R 100.2 0.0 5:35.24
>>>>> ksoftirqd/1
>>>>> 13 root 15 -5 0 0 0 R 100.2 0.0 5:36.98
>>>>> ksoftirqd/3
>>>>> 19 root 15 -5 0 0 0 R 97.8 0.0 5:34.52
>>>>> ksoftirqd/5
>>>>> 25 root 15 -5 0 0 0 R 94.5 0.0 5:13.56
>>>>> ksoftirqd/7
>>>>> 3905 root 20 0 12612 1084 820 R 0.3 0.0 0:00.14 top
>>>>> <snip>
>>>>>
>>>>>
>>>>> It appears that only the odd CPUs are actually handling the
>>>>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>>>> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
>>>>> 66: 2970565 0 0 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-rx-0
>>>>> 67: 28 821122 0 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-rx-1
>>>>> 68: 28 0 2943299 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-rx-2
>>>>> 69: 28 0 0 817776 0
>>>>> 0 0 0 PCI-MSI-edge eth2-rx-3
>>>>> 70: 28 0 0 0 2963924
>>>>> 0 0 0 PCI-MSI-edge eth2-rx-4
>>>>> 71: 28 0 0 0 0
>>>>> 821032 0 0 PCI-MSI-edge eth2-rx-5
>>>>> 72: 28 0 0 0 0
>>>>> 0 2979987 0 PCI-MSI-edge eth2-rx-6
>>>>> 73: 28 0 0 0 0
>>>>> 0 0 845422 PCI-MSI-edge eth2-rx-7
>>>>> 74: 4664732 0 0 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-tx-0
>>>>> 75: 34 4679312 0 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-tx-1
>>>>> 76: 28 0 4665014 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2-tx-2
>>>>> 77: 28 0 0 4681531 0
>>>>> 0 0 0 PCI-MSI-edge eth2-tx-3
>>>>> 78: 28 0 0 0 4665793
>>>>> 0 0 0 PCI-MSI-edge eth2-tx-4
>>>>> 79: 28 0 0 0 0
>>>>> 4671596 0 0 PCI-MSI-edge eth2-tx-5
>>>>> 80: 28 0 0 0 0
>>>>> 0 4665279 0 PCI-MSI-edge eth2-tx-6
>>>>> 81: 28 0 0 0 0
>>>>> 0 0 4664504 PCI-MSI-edge eth2-tx-7
>>>>> 82: 2 0 0 0 0
>>>>> 0 0 0 PCI-MSI-edge eth2:lsc
>>>>>
>>>>>
>>>>> Why would ksoftirqd only run on half of the cores (and only the odd
>>>>> ones to boot)? The one commonality that's striking me is that that
>>>>> all the odd CPU#'s are on the same physical processor:
>>>>>
>>>>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>>>>> processor : 0
>>>>> physical id : 0
>>>>> processor : 1
>>>>> physical id : 1
>>>>> processor : 2
>>>>> physical id : 0
>>>>> processor : 3
>>>>> physical id : 1
>>>>> processor : 4
>>>>> physical id : 0
>>>>> processor : 5
>>>>> physical id : 1
>>>>> processor : 6
>>>>> physical id : 0
>>>>> processor : 7
>>>>> physical id : 1
>>>>>
>>>>> I did compile the kernel with NUMA support... am I being bitten by
>>>>> something there? Other thoughts on where I should look.
>>>>>
>>>>> Also... is there an incantation to get NAPI to work in the torvalds
>>>>> kernel? As you can see, I'm generating quite a few interrrupts.
>>>>>
>>>>> -A
>>>>>
>>>>>
>>>>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@...emloft.net> wrote:
>>>>>> From: Andrew Dickinson <andrew@...dna.net>
>>>>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>>>>
>>>>>>> I'll do some debugging around skb_tx_hash() and see if I can make
>>>>>>> sense of it. I'll let you know what I find. My hypothesis is that
>>>>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>>>>> before I start making claims. ;-P
>>>>>> That's one possibility.
>>>>>>
>>>>>> Another is that the hashing isn't working out. One way to
>>>>>> play with that is to simply replace the:
>>>>>>
>>>>>> hash = skb_get_rx_queue(skb);
>>>>>>
>>>>>> in skb_tx_hash() with something like:
>>>>>>
>>>>>> return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>>>>
>>>>>> and see if that improves the situation.
>>>>>>
>>>> Hi Andrew
>>>>
>>>> Please try following patch (I dont have multi-queue NIC, sorry)
>>>>
>>>> I will do the followup patch if this ones corrects the distribution problem
>>>> you noticed.
>>>>
>>>> Thanks very much for all your findings.
>>>>
>>>> [PATCH] net: skb_tx_hash() improvements
>>>>
>>>> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
>>>> as the device driver exactly told us which queue was selected at RX time.
>>>> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>>>>
>>>> Later improvements would be to compute reciprocal value of real_num_tx_queues
>>>> to avoid a divide here. But this computation should be done once,
>>>> when real_num_tx_queues is set. This needs a separate patch, and a new
>>>> field in struct net_device.
>>>>
>>>> Reported-by: Andrew Dickinson <andrew@...dna.net>
>>>> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
>>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 308a7d0..e2e9e4a 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>>>> {
>>>> u32 hash;
>>>>
>>>> - if (skb_rx_queue_recorded(skb)) {
>>>> - hash = skb_get_rx_queue(skb);
>>>> - } else if (skb->sk && skb->sk->sk_hash) {
>>>> + if (skb_rx_queue_recorded(skb))
>>>> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>> +
>>>> + if (skb->sk && skb->sk->sk_hash)
>>>> hash = skb->sk->sk_hash;
>>>> - } else
>>>> + else
>>>> hash = skb->protocol;
>>>>
>>>> hash = jhash_1word(hash, skb_tx_hashrnd);
>>>>
>>>>
>>> Eric,
>>>
>>> That's exactly what I did! It solved the problem of hot-spots on some
>>> interrupts. However, I now have a new problem (which is documented in
>>> my previous posts). The short of it is that I'm only seeing 4 (out of
>>> 8) ksoftirqd's busy under heavy load... the other 4 seem idle. The
>>> busy 4 are always on one physical package (but not always the same
>>> package (it'll change on reboot or when I change some parameters via
>>> ethtool), but never both. This, despite /proc/interrupts showing me
>>> that all 8 interrupts are being hit evenly. There's more details in
>>> my last mail. ;-D
>>>
>> Well, I was reacting to your 'voodo' comment about
>>
>> return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> Since this is not the problem. Problem is coming from jhash() which shuffles
>> the input, while in your case we want to select same output queue
>> because of cpu affinities. No shuffle required.
>
> Agreed. I don't want to jhash(), and I'm not.
>
>> (assuming cpu0 is handling tx-queue-0 and rx-queue-0,
>> cpu1 is handling tx-queue-1 and rx-queue-1, and so on...)
>
> That's a correct assumption. :D
>
>> Then /proc/interrupts show your rx interrupts are not evenly distributed.
>>
>> Or that ksoftirqd is triggered only on one physical cpu, while on other
>> cpu, softirqds are not run from ksoftirqd. Its only a matter of load.
>
> Hrmm... more fuel for the fire...
>
> The NIC seems to be doing a good job of hashing the incoming data and
> the kernel is now finding the right TX queue:
> -bash-3.2# ethtool -S eth2 | grep -vw 0 | grep packets
> rx_packets: 1286009099
> tx_packets: 1287853570
> tx_queue_0_packets: 162469405
> tx_queue_1_packets: 162452446
> tx_queue_2_packets: 162481160
> tx_queue_3_packets: 162441839
> tx_queue_4_packets: 162484930
> tx_queue_5_packets: 162478402
> tx_queue_6_packets: 162492530
> tx_queue_7_packets: 162477162
> rx_queue_0_packets: 162469449
> rx_queue_1_packets: 162452440
> rx_queue_2_packets: 162481186
> rx_queue_3_packets: 162441885
> rx_queue_4_packets: 162484949
> rx_queue_5_packets: 162478427
>
> Here's where it gets juicy. If I reduce the rate at which I'm pushing
> traffic to a 0-loss level (in this case about 2.2Mpps), then top looks
> as follow:
> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu7 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
>
> And if I watch /proc/interrupts, I see that all of the tx and rx
> queues are handling a fairly similar number of interrupts (ballpark,
> 7-8k/sec on rx, 10k on tx).
>
> OK... now let me double the packet rate (to about 4.4Mpps), top looks like this:
>
> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 1.9%id, 0.0%wa, 5.5%hi, 92.5%si, 0.0%st
> Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 0.0%hi, 1.3%si, 0.0%st
> Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 2.3%id, 0.0%wa, 4.9%hi, 92.9%si, 0.0%st
> Cpu3 : 0.0%us, 0.3%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.0%hi, 1.9%si, 0.0%st
> Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 5.2%id, 0.0%wa, 5.2%hi, 89.6%si, 0.0%st
> Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 97.7%id, 0.0%wa, 0.3%hi, 1.9%si, 0.0%st
> Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 4.9%hi, 94.8%si, 0.0%st
> Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
>
> And if I watch /proc/interrupts again, I see that the even-CPUs (i.e.
> 0,2,4, and 6) RX queues are receiving relatively few interrupts
> (5-ish/sec (not 5k... just 5)) and the odd-CPUS RX queues are
> receiving about 2-3k/sec. What's extra strange is that the TX queues
> are still handling about 10k/sec each.
>
> So, below some magic threshold (approx 2.3Mpps), the box is basically
> idle and happily routing all the packets (I can confirm that my
> network test device ixia is showing 0-loss). Above the magic
> threshold, the box starts acting as described above and I'm unable to
> push it beyond that threshold. While I understand that there are
> limits to how fast I can route packets (obviously), it seems very
> strange that I'm seeing this physical-CPU affinity on the ksoftirqd
> "processes".
>
box is not idle, you hit a bug in kernel, I already corrected this week :)
check for "sched: account system time properly" in google
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..26efa47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4732,7 +4732,7 @@ void account_process_tick(struct task_struct *p, int user_tick)
if (user_tick)
account_user_time(p, one_jiffy, one_jiffy_scaled);
- else if (p != rq->idle)
+ else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, one_jiffy,
one_jiffy_scaled);
else
> Here's how fragile this "magic threshold" is... 2.292 Mpps, box looks
> idle, 0 loss. 2.300 Mpps, even-CPU ksoftirqd processes at 50%-ish.
> 2.307 Mpps, even-CPU ksoftirqd processes at 75%. 2.323 Mpps, even-CPU
> ksoftirqd proccesses at 100%. Never during this did the odd-CPU
> ksoftirqd processes show any utilization at all.
>
> These are 64-byte frames, so I shouldn't be hitting any bandwidth
> issues that I'm aware of, 1.3Gbps in, and 1.3Gbps out (same NIC, I'm
> just routing packets back out the one NIC).
>
> =/
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists