netdev - Re: [PATCH] net-tcp: TCP/IP stack bypass for loopback connections

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1344376722.28967.195.camel@edumazet-glaptop>
Date:	Tue, 07 Aug 2012 23:58:42 +0200
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	"Bruce \"Brutus\" Curtis" <brutus@...gle.com>
Cc:	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, netdev@...r.kernel.org
Subject: Re: [PATCH] net-tcp: TCP/IP stack bypass for loopback connections

On Tue, 2012-08-07 at 14:08 -0700, Bruce "Brutus" Curtis wrote:
> From: "Bruce \"Brutus\" Curtis" <brutus@...gle.com>
> 
> TCP/IP loopback socket pair stack bypass, based on an idea by, and
> rough upstream patch from, David Miller <davem@...emloft.net> called
> "friends", the data structure modifcations and connection scheme are
> reused with extensive data-path changes.
> 
> A new sysctl, net.ipv4.tcp_friends, is added:
>   0: disable friends and use the stock data path.
>   1: enable friends and bypass the stack data path, the default.
> 
> Note, when friends is enabled any loopback interpose, e.g. tcpdump,
> will only see the TCP/IP packets during connection establishment and
> finish, all data bypasses the stack and instead is delivered to the
> destination socket directly.
> 
> Testing done on a 4 socket 2.2GHz "Quad-Core AMD Opteron(tm) Processor
> 8354 CPU" based system, netperf results for a single connection show
> increased TCP_STREAM throughput, increased TCP_RR and TCP_CRR transaction
> rate for most message sizes vs baseline and comparable to AF_UNIX.
> 
> Significant increase (up to 5x) in aggregate throughput for multiple
> netperf runs (STREAM 32KB I/O x N) is seen.
> 
> Some results:
> 
> Default netperf: netperf
>                  netperf -t STREAM_STREAM
>                  netperf -t STREAM_STREAM -- -s 51882 -m 16384 -M 87380
>                  netperf
> 
>          Baseline  AF_UNIX      AF_UNIX           Friends
>          Mbits/S   Mbits/S      Mbits/S           Mbits/S
>            6860       714   8%    9444 138% 1323%  10576 154% 1481% 112%
> 
> Note, for the AF_UNIX (STREAM_STREAM) test 2 results are listed, 1st
> with no options but as the defaults for AF_UNIX sockets are much lower
> performaning a 2nd set of runs with a socket buffer size and send/recv
> buffer sizes equivalent to AF_INET (TCP_STREAM) are done.
> 
> Note, all subsequent AF_UNIX (STREAM_STREAM, STREAM_RR) tests are done
> with "-s 51882" such that the same total effective socket buffering is
> used as for the AF_INET runs defaults (16384+NNNNN/2).
> 
> STREAM 32KB I/O x N: netperf -l 100 -t TCP_STREAM -- -m 32K -M 32K
>                      netperf -l 100 -t STREAM_STREAM -- -s 51882 -m 32K -M 32K
>                      netperf -l 100 -t TCP_STREAM -- -m 32K -M 32K
> 
>           Baseline  AF_UNIX      Friends
>    N  COC Mbits/S   Mbits/S      Mbits/S
>    1   -    8616      9416 109%   11116 129% 118%
>    2   -   15419     17076 111%   20267 131% 119%
>   16   2   59497    303029 509%  347349 584% 115%
>   32   4   54223    273637 505%  272891 503% 100%
>  256  32   58244     85476 147%  273696 470% 320%
>  512  64   58745     87402 149%  260837 444% 298%
> 1600 200   83161    158915 191%  383947 462% 242%
> 
> COC = Cpu Over Commit ratio (16 core platform)
> 
> STREAM: netperf -l 100 -t TCP_STREAM
>         netperf -l 100 -t STREAM_STREAM -- -s 51882
>         netperf -l 100 -t TCP_STREAM
> 
> netperf  Baseline  AF_UNIX      Friends
> -m/-M N  Mbits/S   Mbits/S      Mbits/S
>   64       1020       445  44%     515  50% 116%
>   1K       4881      4340  89%    5070 104% 117%
>   8K       5933      8387 141%    9770 165% 116%
>  32K       8168      9538 117%   11067 135% 116%
>  64K       9116      9774 107%   11515 126% 118%
> 128K       9053     10044 111%   13082 145% 130%
> 256K       9642     10351 107%   13470 140% 130%
> 512K      10050     10142 101%   13327 133% 131%
>   1M       8640      9843 114%   12201 141% 124%
>  16M       7179      9619 134%   11316 158% 118%
> 
> RR: netperf -l 100 -t TCP_RR
>     netperf -l 100 -t STREAM_RR -- -s 51882 -m 16384 -M 87380
>     netperf -l 100 -t TCP_RR
> 
> netperf  Baseline  AF_UNIX      Friends
> -r N,N   Trans./S  Trans./S     Trans./S
>   64      47913     99681 208%   98225 205%  99%
>   1K      44045     92327 210%   91608 208%  99%
>   8K      26732     33201 124%   33025 124%  99%
>  32K      10903     11972 110%   13574 124% 113%
>  64K       7113      6718  94%    7176 101% 107%
> 128K       4191      3431  82%    3695  88% 108%
> 256K       2324      1937  83%    2147  92% 111%
> 512K        958      1056 110%    1202 125% 114%
>   1M        404       508 126%     497 123%  98%
>  16M       26.1      34.1 131%    32.9 126%  96%
> 
> CRR: netperf -l 100 -t TCP_CRR
>      netperf -l 100 -t TCP_CRR
> 
> netperf  Baseline  AF_UNIX      Friends
>   -r N   Trans./S  Trans./S     Trans./S
>   64      14690         -        18191 124%   -
>   1K      14258         -        17492 123%   -
>   8K      11535         -        14012 121%   -
>  32K       7035         -         8974 128%   -
>  64K       4312         -         5654 131%   -
> 128K       2252         -         3179 141%   -
> 256K       1237         -         2008 162%   -
> 512K       17.5*        -         1079   ?    -
>   1M       4.93*        -          458   ?    -
>  16M       8.29*        -         32.5   ?    -
> 
> Note, "-" denotes test not supported for transport.
>       "*" denotes test results reported without statistical confidence.
>       "?" denotes results not comparable.
> 
> SPLICE 32KB I/O:
> 
> Source
>  Sink   Baseline  Friends
>  FSFS   Mbits/S   Mbits/S
>  ----     8042     10810 134%
>  Z---     7071      9773 138%
>  --N-     8039     10820 135%
>  Z-N-     7902      9796 124%
>  -S--    17326     37496 216%
>  ZS--     9008      9573 106%
>  -SN-    16154     36269 225%
>  ZSN-     9531      9640 101%
>  ---S     8466      9933 117%
>  Z--S     8000      9453 118%
>  --NS    12783     11379  89%
>  Z-NS    11055      9489  86%
>  -S-S    12741     24380 191%
>  ZS-S     8097     10242 126%
>  -SNS    16657     30954 186%
>  ZSNS    12108     12763 105%
> 
> Note, "Z" source File /dev/zero, "-" source user memory
>       "N" sink File /dev/null, "-" sink user memory
>       "S" Splice on, "-" Splice off
> 
> Signed-off-by: Bruce \"Brutus\" Curtis <brutus@...gle.com>
> ---
>  include/linux/skbuff.h          |    2 +
>  include/net/request_sock.h      |    1 +
>  include/net/sock.h              |   32 +++-
>  include/net/tcp.h               |    3 +-
>  net/core/skbuff.c               |    1 +
>  net/core/sock.c                 |    1 +
>  net/core/stream.c               |   36 +++
>  net/ipv4/inet_connection_sock.c |   20 ++
>  net/ipv4/sysctl_net_ipv4.c      |    7 +
>  net/ipv4/tcp.c                  |  500 ++++++++++++++++++++++++++++++++++-----
>  net/ipv4/tcp_input.c            |   22 ++-
>  net/ipv4/tcp_ipv4.c             |    2 +
>  net/ipv4/tcp_minisocks.c        |    5 +
>  net/ipv4/tcp_output.c           |   18 ++-
>  net/ipv6/tcp_ipv6.c             |    1 +
>  15 files changed, 576 insertions(+), 75 deletions(-)
> 

A change in Documentation is welcome (for the sysctl)

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 642cb73..2fbca93 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -332,6 +332,7 @@ typedef unsigned char *sk_buff_data_t;
>   *	@cb: Control buffer. Free for use by every layer. Put private vars here
>   *	@_skb_refdst: destination entry (with norefcount bit)
>   *	@sp: the security path, used for xfrm
> + *	@friend: loopback friend socket
>   *	@len: Length of actual data
>   *	@data_len: Data length
>   *	@mac_len: Length of link layer header
> @@ -407,6 +408,7 @@ struct sk_buff {
>  #ifdef CONFIG_XFRM
>  	struct	sec_path	*sp;
>  #endif
> +	struct sock		*friend;

Is it really needed ? 

Since skb wont pass other layers (qdisc, IP, ...) we probably can use
cb[] ?

>  	unsigned int		len,
>  				data_len;
>  	__u16			mac_len,
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 4c0766e..2c74420 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -63,6 +63,7 @@ struct request_sock {
>  	unsigned long			expires;
>  	const struct request_sock_ops	*rsk_ops;
>  	struct sock			*sk;
> +	struct sock			*friend;
>  	u32				secid;
>  	u32				peer_secid;
>  };
> diff --git a/include/net/sock.h b/include/net/sock.h
> index dcb54a0..3b371f5 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -197,6 +197,7 @@ struct cg_proto;
>    *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
>    *	@sk_lock:	synchronizer
>    *	@sk_rcvbuf: size of receive buffer in bytes
> +  *	@sk_friend: loopback friend socket
>    *	@sk_wq: sock wait queue and async head
>    *	@sk_rx_dst: receive input route used by early tcp demux
>    *	@sk_dst_cache: destination cache
> @@ -286,6 +287,14 @@ struct sock {
>  	socket_lock_t		sk_lock;
>  	struct sk_buff_head	sk_receive_queue;
>  	/*
> +	 * If socket has a friend (sk_friend != NULL) then a send skb is
> +	 * enqueued directly to the friend's sk_receive_queue such that:
> +	 *
> +	 *        sk_sndbuf -> sk_sndbuf + sk_friend->sk_rcvbuf
> +	 *   sk_wmem_queued -> sk_friend->sk_rmem_alloc
> +	 */
> +	struct sock		*sk_friend;
> +	/*
>  	 * The backlog queue is special, it is always used with
>  	 * the per-socket spinlock held and requires low latency
>  	 * access. Therefore we special case it's implementation.
> @@ -673,24 +682,40 @@ static inline bool sk_acceptq_is_full(const struct sock *sk)
>  	return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
>  }
>  
> +static inline int sk_wmem_queued_get(const struct sock *sk)
> +{
> +	if (sk->sk_friend)

I try to convince myself sk->sk_friend cannot be changed to NULL after
this test, (by another cpu)


> +		return atomic_read(&sk->sk_friend->sk_rmem_alloc);
> +	else
> +		return sk->sk_wmem_queued;
> +}
> +
> +static inline int sk_sndbuf_get(const struct sock *sk)
> +{
> +	if (sk->sk_friend)
> +		return sk->sk_sndbuf + sk->sk_friend->sk_rcvbuf;
> +	else
> +		return sk->sk_sndbuf;
> +}
> +
>  /

Patch doesnt apply on net-next, so its a bit hard to review it
properly ;)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html