netdev - Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <606676310904302319u1eacc634qde4b1f70e9936779@mail.gmail.com>
Date:	Thu, 30 Apr 2009 23:19:37 -0700
From:	Andrew Dickinson <andrew@...dna.net>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	David Miller <davem@...emloft.net>, jelaas@...il.com,
	netdev@...r.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

On Thu, Apr 30, 2009 at 11:14 PM, Eric Dumazet <dada1@...mosbay.com> wrote:
> Andrew Dickinson a écrit :
>> OK... I've got some more data on it...
>>
>> I passed a small number of packets through the system and added a ton
>> of printks to it ;-P
>>
>> Here's the distribution of values as seen by
>> skb_rx_queue_recorded()... count on the left, value on the right:
>>      37 0
>>      31 1
>>      31 2
>>      39 3
>>      37 4
>>      31 5
>>      42 6
>>      39 7
>>
>> That's nice and even....  Here's what's getting returned from the
>> skb_tx_hash().  Again, count on the left, value on the right:
>>      31 0
>>      81 1
>>      37 2
>>      70 3
>>      37 4
>>      31 6
>>
>> Note that we're entirely missing 5 and 7 and that those interrupts
>> seem to have gotten munged onto 1 and 3.
>>
>> I think the voodoo lies within:
>>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
>>
>> David,  I made the change that you suggested:
>>         //hash = skb_get_rx_queue(skb);
>>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> And now, I see a nice even mixing of interrupts on the TX side (yay!).
>>
>> However, my problem's not solved entirely... here's what top is showing me:
>> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
>> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
>> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
>> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
>> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
>> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
>> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
>> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
>> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
>> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
>> ksoftirqd/1
>>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
>> ksoftirqd/3
>>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
>> ksoftirqd/5
>>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
>> ksoftirqd/7
>>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
>> <snip>
>>
>>
>> It appears that only the odd CPUs are actually handling the
>> interrupts, which doesn't jive with what /proc/interrupts shows me:
>>             CPU0       CPU1     CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
>>   66:    2970565          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-0
>>   67:         28     821122          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-1
>>   68:         28          0    2943299          0          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-2
>>   69:         28          0          0     817776          0
>> 0          0          0   PCI-MSI-edge          eth2-rx-3
>>   70:         28          0          0          0    2963924
>> 0          0          0   PCI-MSI-edge          eth2-rx-4
>>   71:         28          0          0          0          0
>> 821032          0          0   PCI-MSI-edge     eth2-rx-5
>>   72:         28          0          0          0          0
>> 0    2979987          0   PCI-MSI-edge          eth2-rx-6
>>   73:         28          0          0          0          0
>> 0          0     845422   PCI-MSI-edge          eth2-rx-7
>>   74:    4664732          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-0
>>   75:         34    4679312          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-1
>>   76:         28          0    4665014          0          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-2
>>   77:         28          0          0    4681531          0
>> 0          0          0   PCI-MSI-edge          eth2-tx-3
>>   78:         28          0          0          0    4665793
>> 0          0          0   PCI-MSI-edge          eth2-tx-4
>>   79:         28          0          0          0          0
>> 4671596          0          0   PCI-MSI-edge    eth2-tx-5
>>   80:         28          0          0          0          0
>> 0    4665279          0   PCI-MSI-edge          eth2-tx-6
>>   81:         28          0          0          0          0
>> 0          0    4664504   PCI-MSI-edge          eth2-tx-7
>>   82:          2          0          0          0          0
>> 0          0          0   PCI-MSI-edge          eth2:lsc
>>
>>
>> Why would ksoftirqd only run on half of the cores (and only the odd
>> ones to boot)?  The one commonality that's striking me is that that
>> all the odd CPU#'s are on the same physical processor:
>>
>> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
>> processor     : 0
>> physical id   : 0
>> processor     : 1
>> physical id   : 1
>> processor     : 2
>> physical id   : 0
>> processor     : 3
>> physical id   : 1
>> processor     : 4
>> physical id   : 0
>> processor     : 5
>> physical id   : 1
>> processor     : 6
>> physical id   : 0
>> processor     : 7
>> physical id   : 1
>>
>> I did compile the kernel with NUMA support... am I being bitten by
>> something there?  Other thoughts on where I should look.
>>
>> Also... is there an incantation to get NAPI to work in the torvalds
>> kernel?  As you can see, I'm generating quite a few interrrupts.
>>
>> -A
>>
>>
>> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@...emloft.net> wrote:
>>> From: Andrew Dickinson <andrew@...dna.net>
>>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>>
>>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>>> before I start making claims. ;-P
>>> That's one possibility.
>>>
>>> Another is that the hashing isn't working out.  One way to
>>> play with that is to simply replace the:
>>>
>>>                hash = skb_get_rx_queue(skb);
>>>
>>> in skb_tx_hash() with something like:
>>>
>>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>>
>>> and see if that improves the situation.
>>>
>
> Hi Andrew
>
> Please try following patch (I dont have multi-queue NIC, sorry)
>
> I will do the followup patch if this ones corrects the distribution problem
> you noticed.
>
> Thanks very much for all your findings.
>
> [PATCH] net: skb_tx_hash() improvements
>
> When skb_rx_queue_recorded() is true, we dont want to use jash distribution
> as the device driver exactly told us which queue was selected at RX time.
> jhash makes a statistical shuffle, but this wont work with 8 static inputs.
>
> Later improvements would be to compute reciprocal value of real_num_tx_queues
> to avoid a divide here. But this computation should be done once,
> when real_num_tx_queues is set. This needs a separate patch, and a new
> field in struct net_device.
>
> Reported-by: Andrew Dickinson <andrew@...dna.net>
> Signed-off-by: Eric Dumazet <dada1@...mosbay.com>
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 308a7d0..e2e9e4a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
>  {
>        u32 hash;
>
> -       if (skb_rx_queue_recorded(skb)) {
> -               hash = skb_get_rx_queue(skb);
> -       } else if (skb->sk && skb->sk->sk_hash) {
> +       if (skb_rx_queue_recorded(skb))
> +               return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +
> +       if (skb->sk && skb->sk->sk_hash)
>                hash = skb->sk->sk_hash;
> -       } else
> +       else
>                hash = skb->protocol;
>
>        hash = jhash_1word(hash, skb_tx_hashrnd);
>
>

Eric,

That's exactly what I did!  It solved the problem of hot-spots on some
interrupts.  However, I now have a new problem (which is documented in
my previous posts).  The short of it is that I'm only seeing 4 (out of
8) ksoftirqd's busy under heavy load... the other 4 seem idle.  The
busy 4 are always on one physical package (but not always the same
package (it'll change on reboot or when I change some parameters via
ethtool), but never both.  This, despite /proc/interrupts showing me
that all 8 interrupts are being hit evenly.  There's more details in
my last mail. ;-D

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html