netdev - Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49FA932B.4030405@cosmosbay.com>
Date:	Fri, 01 May 2009 08:14:03 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Andrew Dickinson <andrew@...dna.net>
CC:	David Miller <davem@...emloft.net>, jelaas@...il.com,
	netdev@...r.kernel.org
Subject: Re: tx queue hashing hot-spots and poor performance (multiq, ixgbe)

Andrew Dickinson a écrit :
> OK... I've got some more data on it...
> 
> I passed a small number of packets through the system and added a ton
> of printks to it ;-P
> 
> Here's the distribution of values as seen by
> skb_rx_queue_recorded()... count on the left, value on the right:
>      37 0
>      31 1
>      31 2
>      39 3
>      37 4
>      31 5
>      42 6
>      39 7
> 
> That's nice and even....  Here's what's getting returned from the
> skb_tx_hash().  Again, count on the left, value on the right:
>      31 0
>      81 1
>      37 2
>      70 3
>      37 4
>      31 6
> 
> Note that we're entirely missing 5 and 7 and that those interrupts
> seem to have gotten munged onto 1 and 3.
> 
> I think the voodoo lies within:
>     return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
> 
> David,  I made the change that you suggested:
>         //hash = skb_get_rx_queue(skb);
>         return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> 
> And now, I see a nice even mixing of interrupts on the TX side (yay!).
> 
> However, my problem's not solved entirely... here's what top is showing me:
> top - 23:37:49 up 9 min,  1 user,  load average: 3.93, 2.68, 1.21
> Tasks: 119 total,   5 running, 114 sleeping,   0 stopped,   0 zombie
> Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
> Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  4.3%hi, 95.7%si,  0.0%st
> Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
> Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,  2.0%id,  0.0%wa,  4.0%hi, 94.0%si,  0.0%st
> Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,  5.6%id,  0.0%wa,  2.3%hi, 92.1%si,  0.0%st
> Mem:  16403476k total,   335884k used, 16067592k free,    10108k buffers
> Swap:  2096472k total,        0k used,  2096472k free,   146364k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     7 root      15  -5     0    0    0 R 100.2  0.0   5:35.24
> ksoftirqd/1
>    13 root      15  -5     0    0    0 R 100.2  0.0   5:36.98
> ksoftirqd/3
>    19 root      15  -5     0    0    0 R 97.8  0.0   5:34.52
> ksoftirqd/5
>    25 root      15  -5     0    0    0 R 94.5  0.0   5:13.56
> ksoftirqd/7
>  3905 root      20   0 12612 1084  820 R  0.3  0.0   0:00.14 top
> <snip>
> 
> 
> It appears that only the odd CPUs are actually handling the
> interrupts, which doesn't jive with what /proc/interrupts shows me:
>             CPU0       CPU1	  CPU2       CPU3	CPU4	   CPU5       CPU6	 CPU7
>   66:    2970565          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-0
>   67:         28     821122          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-1
>   68:         28          0    2943299          0          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-2
>   69:         28          0          0     817776          0
> 0          0          0   PCI-MSI-edge	  eth2-rx-3
>   70:         28          0          0          0    2963924
> 0          0          0   PCI-MSI-edge	  eth2-rx-4
>   71:         28          0          0          0          0
> 821032          0          0   PCI-MSI-edge	  eth2-rx-5
>   72:         28          0          0          0          0
> 0    2979987          0   PCI-MSI-edge	  eth2-rx-6
>   73:         28          0          0          0          0
> 0          0     845422   PCI-MSI-edge	  eth2-rx-7
>   74:    4664732          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-0
>   75:         34    4679312          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-1
>   76:         28          0    4665014          0          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-2
>   77:         28          0          0    4681531          0
> 0          0          0   PCI-MSI-edge	  eth2-tx-3
>   78:         28          0          0          0    4665793
> 0          0          0   PCI-MSI-edge	  eth2-tx-4
>   79:         28          0          0          0          0
> 4671596          0          0   PCI-MSI-edge	  eth2-tx-5
>   80:         28          0          0          0          0
> 0    4665279          0   PCI-MSI-edge	  eth2-tx-6
>   81:         28          0          0          0          0
> 0          0    4664504   PCI-MSI-edge	  eth2-tx-7
>   82:          2          0          0          0          0
> 0          0          0   PCI-MSI-edge	  eth2:lsc
> 
> 
> Why would ksoftirqd only run on half of the cores (and only the odd
> ones to boot)?  The one commonality that's striking me is that that
> all the odd CPU#'s are on the same physical processor:
> 
> -bash-3.2# cat /proc/cpuinfo | grep -E '(physical|processor)' | grep -v virtual
> processor	: 0
> physical id	: 0
> processor	: 1
> physical id	: 1
> processor	: 2
> physical id	: 0
> processor	: 3
> physical id	: 1
> processor	: 4
> physical id	: 0
> processor	: 5
> physical id	: 1
> processor	: 6
> physical id	: 0
> processor	: 7
> physical id	: 1
> 
> I did compile the kernel with NUMA support... am I being bitten by
> something there?  Other thoughts on where I should look.
> 
> Also... is there an incantation to get NAPI to work in the torvalds
> kernel?  As you can see, I'm generating quite a few interrrupts.
> 
> -A
> 
> 
> On Thu, Apr 30, 2009 at 7:08 AM, David Miller <davem@...emloft.net> wrote:
>> From: Andrew Dickinson <andrew@...dna.net>
>> Date: Thu, 30 Apr 2009 07:04:33 -0700
>>
>>>  I'll do some debugging around skb_tx_hash() and see if I can make
>>> sense of it.  I'll let you know what I find.  My hypothesis is that
>>> skb_record_rx_queue() isn't being called, but I should dig into it
>>> before I start making claims. ;-P
>> That's one possibility.
>>
>> Another is that the hashing isn't working out.  One way to
>> play with that is to simply replace the:
>>
>>                hash = skb_get_rx_queue(skb);
>>
>> in skb_tx_hash() with something like:
>>
>>                return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
>>
>> and see if that improves the situation.
>>

Hi Andrew

Please try following patch (I dont have multi-queue NIC, sorry)

I will do the followup patch if this ones corrects the distribution problem
you noticed.

Thanks very much for all your findings.

[PATCH] net: skb_tx_hash() improvements

When skb_rx_queue_recorded() is true, we dont want to use jash distribution
as the device driver exactly told us which queue was selected at RX time.
jhash makes a statistical shuffle, but this wont work with 8 static inputs.

Later improvements would be to compute reciprocal value of real_num_tx_queues
to avoid a divide here. But this computation should be done once,
when real_num_tx_queues is set. This needs a separate patch, and a new
field in struct net_device.

Reported-by: Andrew Dickinson <andrew@...dna.net>
Signed-off-by: Eric Dumazet <dada1@...mosbay.com>

diff --git a/net/core/dev.c b/net/core/dev.c
index 308a7d0..e2e9e4a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1735,11 +1735,12 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
 
-	if (skb_rx_queue_recorded(skb)) {
-		hash = skb_get_rx_queue(skb);
-	} else if (skb->sk && skb->sk->sk_hash) {
+	if (skb_rx_queue_recorded(skb))
+		return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
-	} else
+	else
 		hash = skb->protocol;
 
 	hash = jhash_1word(hash, skb_tx_hashrnd);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html