Here's the beginning of a howto for driver authors. The intended audience for this howto is people already familiar with netdevices. 1.0 Netdevice Prerequisites ------------------------------ For hardware-based netdevices, you must have at least hardware that is capable of doing DMA with many descriptors; i.e., having hardware with a queue length of 3 (as in some fscked ethernet hardware) is not very useful in this case. 2.0 What is new in the driver API ----------------------------------- There is 1 new method and one new variable introduced that the driver author needs to be aware of. These are: 1) dev->hard_end_xmit() 2) dev->xmit_win 2.1 Using Core driver changes ----------------------------- To provide context, let's look at a typical driver abstraction for dev->hard_start_xmit(). It has 4 parts: a) packet formatting (example: vlan, mss, descriptor counting, etc.) b) chip-specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interrupts, etc. [For code cleanliness/readability sake, regardless of this work, one should break the dev->hard_start_xmit() into those 4 functional blocks anyways]. A driver which has all 4 parts and needing to support batching is advised to split its dev->hard_start_xmit() in the following manner: 1) use its dev->hard_end_xmit() method to achieve #d 2) use dev->xmit_win to tell the core how much space you have. #b and #c can stay in ->hard_start_xmit() (or whichever way you want to do this) Section 3. shows more details on the suggested usage. 2.1.1 Theory of operation -------------------------- 1. Core dequeues from qdiscs upto dev->xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs device's TX_LOCK 3. Core loop for all skbs: ->invokes driver dev->hard_start_xmit() 4. Core invokes driver dev->hard_end_xmit() if packets xmitted 2.1.1.1 The slippery LLTX ------------------------- Since these type of drivers are being phased out and they require extra code they will not be supported anymore. So as oct07 the code that supports them has been removed. 2.1.1.2 xmit_win ---------------- dev->xmit_win variable is set by the driver to tell us how much space it has in its rings/queues. This detail is then used to figure out how many packets are retrieved from the qdisc queues (in order to send to the driver). dev->xmit_win is introduced to ensure that when we pass the driver a list of packets it will swallow all of them -- which is useful because we don't requeue to the qdisc (and avoids burning unnecessary CPU cycles or introducing any strange re-ordering). Essentially the driver signals us how much space it has for descriptors by setting this variable. 2.1.1.2.1 Setting xmit_win -------------------------- This variable should be set during xmit path shutdown(netif_stop), wakeup(netif_wake) and ->hard_end_xmit(). In the case of the first one the value is set to 1 and in the other two it is set to whatever the driver deems to be available space on the ring. 3.0 Driver Essentials --------------------- The typical driver tx state machine is: ---- -1-> +Core sends packets +--> Driver puts packet onto hardware queue + if hardware queue is full, netif_stop_queue(dev) + -2-> +core stops sending because of netif_stop_queue(dev) .. .. time passes ... .. -3-> +---> driver has transmitted packets, opens up tx path by invoking netif_wake_queue(dev) -1-> +Cycle repeats and core sends more packets (step 1). ---- 3.1 Driver prerequisite -------------------------- This is _a very important_ requirement in making batching useful. The prerequisite for batching changes is that the driver should provide a low threshold to open up the tx path. Drivers such as tg3 and e1000 already do this. Before you invoke netif_wake_queue(dev) you check if there is a threshold of space reached to insert new packets. Here's an example of how I added it to tun driver. Observe the setting of dev->xmit_win. --- +#define NETDEV_LTT 4 /* the low threshold to open up the tx path */ .. .. u32 t = skb_queue_len(&tun->readq); if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) { tun->dev->xmit_win = tun->dev->tx_queue_len; netif_wake_queue(tun->dev); } --- Heres how the batching e1000 driver does it: -- if (unlikely(cleaned && netif_carrier_ok(netdev) && E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) { if (netif_queue_stopped(netdev)) { int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2); netdev->xmit_win = rspace; netif_wake_queue(netdev); } --- in tg3 code (with no batching changes) looks like: ----- if (netif_queue_stopped(tp->dev) && (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))) netif_wake_queue(tp->dev); --- 3.2 Driver Setup ----------------- *) On initialization (before netdev registration) 1) set NETIF_F_BTX in dev->features i.e., dev->features |= NETIF_F_BTX This makes the core do proper initialization. 2) set dev->xmit_win to something reasonable like maybe half the tx DMA ring size etc. 3) create proper pointer to the ->hard_end_xmit() method. 3.3 Annotation on the different methods ---------------------------------------- This section shows examples and offers suggestions on how the different methods and variable could be used. 3.3.1 dev->hard_start_xmit() ---------------------------- Here's an example of tx routine that is similar to the one I added to the current tun driver. bxmit suffix is kept so that you can turn off batching if needed via an ethtool interface and call already existing interface. ---- static int xxx_net_bxmit(struct net_device *dev) { .... .... enqueue onto hardware ring if (hardware ring full) { netif_stop_queue(dev); dev->xmit_win = 1; } ....... .. . } ------ All return codes like NETDEV_TX_OK etc. still apply. In addition a new code NETDEV_TX_DROPPED should be returned if the packet is dropped. This helps the core layer to account for transmitted packets and invoke dev->hard_end_xmit() at the end of batch when one or more packets are transmitted.. 3.3.2 The tx complete, dev->hard_end_xmit() ------------------------------------------------- In this method, if there are any IO operations that apply to a set of packets such as kicking DMA, setting of interrupt thresholds etc., leave them to the end and apply them once if you have successfully enqueued. This provides a mechanism for saving a lot of CPU cycles since IO is cycle expensive. Here is a simplified tg3 dev->hard_end_xmit(): ---- void tg3_complete_xmit(struct net_device *dev) { /* Packets are ready, update Tx producer idx local and on card. */ tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry); if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) { netif_stop_queue(dev); dev->xmit_win = 1; if (tg3_tx_avail(tp) >= TG3_TX_WAKEUP_THRESH(tp)) { tg3_set_win(tp); netif_wake_queue(dev); } } else { tg3_set_win(tp); } mmiowb(); dev->trans_start = jiffies; } ------- 3.3.3 setting the dev->xmit_win --------------------------------- As mentioned earlier this variable provides hints on how much data to send from the core to the driver. Here are the obvious ways: a) on doing a netif_stop, set it to 1. By default all drivers have this value set to 1 to emulate old behavior where a driver only receives one packet at a time. b) on netif_wake_queue set it to the max available space. You have to be careful if your hardware does scatter-gather since the core will pass you scatter-gatherable skbs and so you want to at least leave enough space for the maximum allowed. Look at the tg3 and e1000 to see how this is implemented. The variable is important because it avoids the core sending any more than what the driver can handle, therefore avoiding any need to muck with packet scheduling mechanisms. Appendix 1: History ------------------- June 11/2007: Initial revision June 11/2007: Fixed typo on e1000 netif_wake description .. Aug 08/2007: Added info on VLAN and the skb->cb[] danger .. Sep 24/2007: Revised and cleaned up Sep 25/2007: Cleanups from Randy Dunlap Oct 08/2007: Removed references to LLTX and packet formatting Oct 09/2007: Added reference to NETDEV_TX_DROPPED