netdev - Re: bond + tc regression ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4A0105A8.3060707@cosmosbay.com>
Date:	Wed, 06 May 2009 05:36:08 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Vladimir Ivashchenko <hazard@...ncoudi.com>
CC:	netdev@...r.kernel.org
Subject: Re: bond + tc regression ?

Vladimir Ivashchenko a écrit :
> On Tue, May 05, 2009 at 08:50:26PM +0200, Eric Dumazet wrote:
> 
>>> I have tried with IRQs bound to one CPU per NIC. Same result.
>> Did you check "grep eth /proc/interrupts" that your affinities setup 
>> were indeed taken into account ?
>>
>> You should use same CPU for eth0 and eth2 (bond0),
>>
>> and another CPU for eth1 and eth3 (bond1)
> 
> Ok, the best result is when assign all IRQs to the same CPU. Zero drops.
> 
> When I bind slaves of bond interfaces to the same CPU, I start to get 
> some drops, but much less than before. I didn't play with combinations.
> 
> My problem is, after applying your accounting patch below, one of my 
> HTB servers reports only 30-40% CPU idle on one of the cores. That won't 
> take me for very long, load balancing across cores is needed.
> 
> Is there any way at least to balance individual NICs on per core basis?
> 

Problem of this setup is you have four NICS, but two logical devices (bond0
& bond1) and a central HTB thing. This essentialy makes flows go through the same
locks (some rwlocks guarding bonding driver, and others guarding HTB structures).

Also when a cpu receives a frame on ethX, it has to forward it on ethY, and
another lock guards access to TX queue of ethY device. If another cpus receives
a frame on ethZ and want to forward it to ethY device, this other cpu will
need same locks and everything slowdown.

I am pretty sure you could get good results choosing two cpus sharing same L2
cache. L2 on your cpu is 6MB. Another point would be to carefuly choose size
of RX rings on ethX devices. You could try to *reduce* them so that number
of inflight skb is small enough that everything fits in this 6MB cache.

Problem is not really CPU power, but RAM bandwidth. Having two cores instead of one
attached to one central memory bank wont increase ram bandwidth, but reduce it.

And making several cores compete for locks on this ram only slows down processing.

Only choice we have is to change bonding so that this driver uses RCU instead
of rwlocks, but it is probably a complex task. Multiple cpus accessing
bonding structures could share memory structures without dirtying them
and ping-pong cache lines.

Ah, I forgot about one patch that could help your setup too (if using more than one
cpu on NIC irqs of course), queued for 2.6.31

(commit 6a321cb370ad3db4ba6e405e638b3a42c41089b0)

You could post oprofile results to help us finding other hot spots.

[PATCH] net: netif_tx_queue_stopped too expensive

netif_tx_queue_stopped(txq) is most of the time false.

Yet its cost is very expensive on SMP.

static inline int netif_tx_queue_stopped(const struct netdev_queue *dev_queue)
{
	return test_bit(__QUEUE_STATE_XOFF, &dev_queue->state);
}

I saw this on oprofile hunting and bnx2 driver bnx2_tx_int().

We probably should split "struct netdev_queue" in two parts, one
being read mostly.

__netif_tx_lock() touches _xmit_lock & xmit_lock_owner, these
deserve a separate cache line.

Signed-off-by: Eric Dumazet <dada1@...mosbay.com>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e7783f..1caaebb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -447,12 +447,18 @@ enum netdev_queue_state_t
 };

 struct netdev_queue {
+/*
+ * read mostly part
+ */
 	struct net_device	*dev;
 	struct Qdisc		*qdisc;
 	unsigned long		state;
-	spinlock_t		_xmit_lock;
-	int			xmit_lock_owner;
 	struct Qdisc		*qdisc_sleeping;
+/*
+ * write mostly part
+ */
+	spinlock_t		_xmit_lock ____cacheline_aligned_in_smp;
+	int			xmit_lock_owner;
 } ____cacheline_aligned_in_smp;

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html