[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1267817699.30393.8.camel@c-dwalke-linux.qualcomm.com>
Date: Fri, 05 Mar 2010 11:34:59 -0800
From: Daniel Walker <dwalker@...eaurora.org>
To: "David S. Miller" <davem@...emloft.net>
Cc: Jim Harford <c_jharfo@...cinc.com>, netdev@...r.kernel.org
Subject: [PATCH] net: Fix race condition on receive path.
From: Jim Harford <c_jharfo@...cinc.com>
Fixes a race condition on the networking receive path that causes all
received packets to be dropped after 15-60 minutes of heavy network usage.
Function process_backlog() empties the receive queue, re-enables
interrupts, then "completes" the softIRQ. This provides a time window for
netif_rx() to execute (in IRQ context) and enqueue a received packet
without re-scheduling the softIRQ. After this, the receive queue is never
processed and the system eventually begins to drop all received packets.
The fix has netif_rx() calling napi_schedule(), whether or not the
receive queue is empty. There are two questions that must be addressed:
1) Is the fix correct? Or, is it safe to always call napi_schedule()?
2) Does the fix result in a performance degradation?
With respect to correctness, the napi softIRQ routine should NOT be
scheduled if it is already running. This is controlled by the bit flag
NAPI_STATE_SCHED in the "state" field of struct napi_struct. When this
flag is set, it means the softIRQ routine is either already pending or
currently running. When the softIRQ is finished, the bit is cleared via
__napi_complete(), which is called by napi_complete(), which is called by
process_backlog() after re-enabling interrupts and just prior to
returning.
When netif_rx() calls napi_schedule(), that routine checks the
NAPI_STATE_SCHED bit flag by calling napi_schedule_prep(), which calls
test_and_set_bit(); if the flag is already set, napi_schedule() returns
without doing anything. Note that test_and_set_bit() implements an
atomic "test and set" operation that is commonly used as the foundation
of mutual exclusion mechanisms.
So it is safe for netif_rx() to call napi_schedule(), whether or not the
receive queue is empty.
With respect to performance, the following behavior was observed using an
ARM CPU and cross-compiling for ARM using gcc. Removing the test of
whether the receive queue is empty saves 2 assembly instructions. The
addition of calling napi_schedule(), when that routine does nothing
(because the NAPI_STATE_SCHED flag is set) adds 11 assembly instructions.
Thus, there is a net addition of 9 assembly instructions. Furthermore,
we observe in our testing that even under heavy network load, the receive
queue is empty on at least 95% of netif_rx() invocations. The 9 assembly
instructions amortized over all netif_rx() invocations yield an average
performance penalty of less than 1 assembly instruction per invocation of
netif_rx(). Finally, the test-and-set operation does not entail testing
common memory shared by multiple CPU cores (which can be costly), because
the data structures in question are specific to a single CPU by design.
So calling napi_schedule() at every invocation of netif_rx() does not
impair performance.
CRs-fixed: 184599
Signed-off-by: Jim Harford <c_jharfo@...cinc.com>
Signed-off-by: Daniel Walker <dwalker@...eaurora.org>
---
net/core/dev.c | 11 +++--------
1 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index be9924f..43161c6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2150,15 +2150,10 @@ int netif_rx(struct sk_buff *skb)
__get_cpu_var(netdev_rx_stat).total++;
if (queue->input_pkt_queue.qlen <= netdev_max_backlog) {
- if (queue->input_pkt_queue.qlen) {
-enqueue:
- __skb_queue_tail(&queue->input_pkt_queue, skb);
- local_irq_restore(flags);
- return NET_RX_SUCCESS;
- }
-
+ __skb_queue_tail(&queue->input_pkt_queue, skb);
napi_schedule(&queue->backlog);
- goto enqueue;
+ local_irq_restore(flags);
+ return NET_RX_SUCCESS;
}
__get_cpu_var(netdev_rx_stat).dropped++;
--
1.6.3.3
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists