[NET_SCHED] explict hold dev tx lock

For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of qdisc operation, when all N CPUs contend for
the dequeue region and one of them (after all the work) entering
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. My tests are UDP
based and keep all 4CPUs busy all the time for the period of the test.
Both e1000 and tg3 showed similar behavior.
I expect higher gains with more CPUs. Summary below with different
UDP packets and the resulting pps seen. Note at around 200Bytes,
the two dont seem that much different and we are approaching wire
speed (with plenty of CPU available; eg at 512B, the app is sitting
at 80% idle on both cases).

        +------------+--------------+-------------+------------+--------+
pktsize |       64B  |  128B        | 256B        | 512B       |1024B   |
        +------------+--------------+-------------+------------+--------+
Original| 467482     | 463061       | 388267      | 216308     | 114704 |
        |            |              |             |            |        |
txlock  | 468922     | 464060       | 388298      | 216316     | 114709 |
        -----------------------------------------------------------------

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit b0e36991c5850dfe930f80ee508b08fdcabc18d1
tree b1787bba26f80a325298f89d1ec882cc5ab524ae
parent 42765047105fdd496976bc1784d22eec1cd9b9aa
author Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 09:09:17 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Sun, 23 Sep 2007 09:09:17 -0400

 net/sched/sch_generic.c |   19 ++-----------------
 1 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..95ae119 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
 	int ret;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
 		return 0;
 
-	/*
-	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
-	 * locks can be quite expensive. The driver can do a trylock, as
-	 * is being done here; in case of lock contention it should return
-	 * NETDEV_TX_LOCKED and the packet will be requeued.
-	 */
-	lockless = (dev->features & NETIF_F_LLTX);
-
-	if (!lockless && !netif_tx_trylock(dev)) {
-		/* Another CPU grabbed the driver tx lock */
-		return handle_dev_cpu_collision(skb, dev, q);
-	}
 
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
+	HARD_TX_LOCK(dev, smp_processor_id());
 	ret = dev_hard_start_xmit(skb, dev);
-
-	if (!lockless)
-		netif_tx_unlock(dev);
+	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
 	q = dev->qdisc;