From: Wenji Wu - Subject Potential performance bottleneck for Linux TCP (2.6 Desktop, Low-latency Desktop) - Why the kernel needed patching For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. - The overall design apparoch in the patch the underlying idea here is that when there are packets waiting on the prequeue or backlog queue, do not allow the data receiving process to release the CPU for long. - Implementation details We have modified the Linux process scheduling policy and tcp_recvmsg(). To summarize, the solution works as follows: an expired data receiving process with packets waiting on backlog queue or prequeue is moved to the active array, instead of expired array as usual. More often than not, the expired data receiving process will continue to run. Even it doesn’t, the wait time before it resumes its execution will be greatly reduced. However, this gives the process extra runs compared to other processes in the runqueue. For the sake of fairness, the process would be labeled with the extra_run_flag. Also considering the facts that: (1) the resumed process will continue its execution within tcp_recvmsg(); (2) tcp_recvmsg() does not return to user space until the prequeue and backlog queue are drained. For the sake of fairness, we modified tcp_recvmsg() as such: after prequeue and backlog queue are drained and before tcp_recvmsg() returns to user space, any process labeled with the extra_run_flag will call yield() to explicitly yield the CPU to other proc-esses in the runqueue. yield() works by removing the process from the active array (where it current is, because it is running), and inserting it into the expired array. Also, to prevent processes in the expired array from starving, A special rule has been provided for Linux process scheduling (the same rule used for interactive processes): an expired process is moved to the expired array without respect to its status if processes in the expired array are starved. Changed files: /kernel/sched.c /kernel/fork.c /include/linux/sched.h /net/ipv4/tcp.c - Testing results The proposed solution tradeoffs a small amount of fairness performance to resolve the TCP performance bottleneck. The proposed solution won’t cause serious fairness issue. The patch is for Linux kernel 2.6.14 Deskop and Low-latency Desktop