lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 08 May 2009 17:12:39 +0200
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Krzysztof Halasa <khc@...waw.pl>
CC:	netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>
Subject: [PATCH] net: reduce number of reference taken on sk_refcnt

Krzysztof Halasa a écrit :
> Hi,
> 
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
> 
> I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
> are 266-667 MHz ARM XScale CPUs and the interrupts handling just
> transmitted sk_buffs are a pain. Could I delay those interrupts
> somehow?

This is a pain yes, I agree. But some improvements can be done.
(even on drivers already using NAPI, especially if same interrupt
is used both for RX and RX queues)

For example, we can avoid the dst_release() cache miss if this
is done in start_xmit(), and not later in TX completion while freeing skb.
I tried various patches in the past but unfortunatly it seems
only safe way to do this is in the driver xmit itself, not in core
network stack. This would need many patches, one for each driver.

Then, if your workload is UDP packets, a recent patch
avoided an expensive wakeup at TX completion time
(check commit bf368e4e70cd4e0f880923c44e95a4273d725ab4

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4

Unfortunatly this patch is only for 2.6.30, since it is based
on a new infrastructure from Davide Libenzi)

Then, additional work should be done to reduce cache lines
misses in sock_wfree()

This function currently touches many different cache lines
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt). And on UDP, extra cachelines because
we call sock_def_write_space(). Yes this sucks.

We could avoid touching sk_refcnt in several cases. We
need a reference count only for the whole packets attached to a socket,
granted we use atomic_add_return() and such functions to detect 0
transitions. That way, if a tcp stream always has some packets in flight,
TX completion would not have to decrement sk_refcnt.

Following patch (based on net-next-2.6) to illustrate my point :

[PATCH] net: reduce number of reference taken on sk_refcnt

Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
in flight. This hurts some workloads at TX completion time, because
sock_wfree() has three cache lines to touch at least.
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt)

We could use only one reference count, taken only when sk_wmem_alloc
is changed from or to ZERO value (ie one reference count for any number
of in-flight packets)

Not all atomic_add() must be changed to atomic_add_return(), if we
know current sk_wmem_alloc is already not null.

This patch reduces by one number of cache lines dirtied in sock_wfree()
and number of atomic operation in some workloads. 

Signed-off-by: Eric Dumazet <dada1@...mosbay.com>

diff --cc include/net/tcp.h
index ac37228,87d210b..0000000
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..d57e88b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1213,14 +1213,17 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
  *
  * 	Inlined as it's very short and called for pretty much every
  *	packet ever received.
+ *	We take a reference on sock only if this is the first packet
+ *	accounted for. (one reference count for any number of in flight packets)
  */
 
 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
-	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+	if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) ==
+	    skb->truesize)
+		sock_hold(sk);
 }
 
 static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dbf3ff..bf10084 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1170,12 +1170,17 @@ void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
+	int res;
 
 	/* In case it might be waiting for more memory. */
-	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
 		sk->sk_write_space(sk);
-	sock_put(sk);
+	/*
+	 * No more packets in flight, we must release sock refcnt
+	 */
+	if (res == 0)
+		sock_put(sk);
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ