The code for this work can be found at: git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git The purpose of this doc is not to describe the batching work although benefits of that approach could be gleaned from this doc (example section 1.x should give the implicit value proposition). There are more details on the rest of the batching work described in the driver howto posted on netdev as well as a couple of presentations i have given in the past (refer to the last few slides of my netconf 2006 slides for example). The purpose of this doc is to describe the evolution of the work work which leads to two important apis to the driver writer. The first one is introduction of a method called dev->hard_prep_xmit() and the second api is the introduction of variable dev->xmit_win For the sake of clarity i will be using non-LLTX because it is easier to explain. Note: The e1000, although non-LLTX has shown considerable improvement with this approach as well as verified by experiments. (For a lot of other reasons e1000 needs to be converted to be non-LLTX). 1.0 Classical approach ------------------------ Lets start with classical approach of what a random driver will do loop: 1--core spin for qlock 1a---dequeu packet 1b---release qlock 2--core spins for xmit_lock 3--enter hardware xmit routine 3a----format packet: --------e.g vlan, mss, shinfo, csum, descriptor count, etc -----------if there is something wrong free skb and return OK 3b----do per chip specific checks -----------if there is something wrong free skb and return OK 3c----all is good, now stash a packet into DMA ring. 4-- release tx lock 5-- if all ok (there are still packets, netif not stopped etc) continue, else break end_loop: In between #1 and #1b, another CPU contends for qlock In between #3 and #4, another CPU contends for txlock 1.1 Challenge to classical approach: ------------------------------------- The cost of grabbing/setting up a lock is not cheap. Observation also: spinning CPUs is expensive because the utilization of the compute cycles goes down. We assume the cost of dequeueing from the qdisc is not as expensive as the enqueueing to DMA-ring. 1.2 Addressing the challenge to classical approach: --------------------------------------------------- So we start with a simple premise to resolve the challenge above. We try to amortize the cost by: a) grabbing "as many" packets as we can between #1 and #1b with very little processing so we dont hold that lock for long. b) then send "as many" as we can in between #3 and #4. Lets start with a simple approach (which is what the batch code did in its earlier versions): loop: 1--core spin for qlock loop1: // "as many packets" 1a---dequeu and enqueue on dev->blist end loop1: 1b---release qlock 2--core spins for xmit_lock 3--enter hardware xmit routine loop2 ("for as many packets" or "no more ring space"): 3a----format packet --------e.g vlan, mss, shinfo, csum, descriptor count, etc -----------if there is something wrong free skb and return OK (**) 3b----do per chip specific checks -----------if there is something wrong free skb and return OK 3c----all is good, now stash a packet into DMA ring. end_loop2: 3d -- if you enqueued packets tell tx DMA to chew on them ... 4-- release tx lock 5-- if all ok (there are still packets, netif not stopped etc) continue; end_loop loop2 has the added side-benefit of improving the instruction cache warmness. i.e If we can do things in a loop we would keep the instruction cache warm for more packets (hence improve the hit rate/CPU-utilization). However, the length of this loop may affect the hit-rate (very clear to see as you use machines with bigger caches and qdisc_restart showing in profiles). Note also: We have also amortized the cost of bus IO in step 3d because we make only one call after exiting the loop. 2.0 New challenge ----------------- A new challenge is introduced. We could hold the tx lock for "long period" depending on how unclean the driver path is. i.e imagine holding this lock for 100 packets. So we need to find a balance. We observe that within loop2, #3a doesnt need to be held within the xmit lock. All it does is muck with an skb and may end up infact dropping it. 2.1 Addressing new challenge ----------------------------- To address the new challenge, i introduced the dev->hard_prep_xmit() api. Essentially, dev->hard_prep_xmit() moves the packet formatting into #1 So now we change the code to be something along the lines of: loop: 1--core spin for qlock loop1: // "as many packets" 1a---dequeu and enqueue on dev->blist 1b--if driver has hard_prep_xmit, then format packet. end loop1: 1c---release qlock 2--core spins for xmit_lock 3--enter hardware xmit routine loop2: ("for as many packets" or "no more ring space"): 3a----do per chip specific checks -----------if there is something wrong free skb and return OK 3b----all is good, now stash a packet into DMA ring. end_loop2: 3c -- if you enqueued packets tell tx DMA to chew on them ... 4-- release tx lock 5-- if all ok (there are still packets, netif not stopped etc) continue; end_loop 3.0 TBA: -------- talk here about further optimizations added starting with xmit_win.. Appendix 1: HISTORY --------------------- Aug 08/2007 - initial revision