lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20120509055038.584464078@decadent.org.uk>
Date:	Wed, 09 May 2012 06:51:32 +0100
From:	Ben Hutchings <ben@...adent.org.uk>
To:	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Cc:	torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
	alan@...rguk.ukuu.org.uk, Eric Dumazet <eric.dumazet@...il.com>,
	Tom Herbert <therbert@...gle.com>,
	Nandita Dukkipati <nanditad@...gle.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Yuchung Cheng <ycheng@...gle.com>,
	"H.K. Jerry Chu" <hkchu@...gle.com>,
	Maciej Żenczykowski <maze@...gle.com>,
	Mahesh Bandewar <maheshb@...gle.com>,
	Ilpo JÀrvinen <ilpo.jarvinen@...sinki.fi>,
	"David S. Miller" <davem@...emloft.net>
Subject: [ 063/167] [PATCH 03/26] tcp: allow splice() to build full TSO packets

3.2-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Eric Dumazet <eric.dumazet@...il.com>

[ This combines upstream commit
  2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
  commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]

vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert <therbert@...gle.com>
Cc: Nandita Dukkipati <nanditad@...gle.com>
Cc: Neal Cardwell <ncardwell@...gle.com>
Cc: Tom Herbert <therbert@...gle.com>
Cc: Yuchung Cheng <ycheng@...gle.com>
Cc: H.K. Jerry Chu <hkchu@...gle.com>
Cc: Maciej Żenczykowski <maze@...gle.com>
Cc: Mahesh Bandewar <maheshb@...gle.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@...il.com>
Signed-off-by: David S. Miller <davem@...emloft.net>
Signed-off-by: Ben Hutchings <ben@...adent.org.uk>
---
 fs/splice.c            |    5 ++++-
 include/linux/socket.h |    2 +-
 net/ipv4/tcp.c         |    2 +-
 net/socket.c           |    6 +++---
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index fa2defa..6d0dfb8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -31,6 +31,7 @@
 #include <linux/uio.h>
 #include <linux/security.h>
 #include <linux/gfp.h>
+#include <linux/socket.h>
 
 /*
  * Attempt to steal a page from a pipe buffer. This should perhaps go into
@@ -691,7 +692,9 @@ static int pipe_to_sendpage(struct pipe_inode_info *pipe,
 	if (!likely(file->f_op && file->f_op->sendpage))
 		return -EINVAL;
 
-	more = (sd->flags & SPLICE_F_MORE) || sd->len < sd->total_len;
+	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
+	if (sd->len < sd->total_len)
+		more |= MSG_SENDPAGE_NOTLAST;
 	return file->f_op->sendpage(file, buf->page, buf->offset,
 				    sd->len, &pos, more);
 }
diff --git a/include/linux/socket.h b/include/linux/socket.h
index d0e77f6..ad919e0 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -265,7 +265,7 @@ struct ucred {
 #define MSG_NOSIGNAL	0x4000	/* Do not generate SIGPIPE */
 #define MSG_MORE	0x8000	/* Sender will send more */
 #define MSG_WAITFORONE	0x10000	/* recvmmsg(): block until 1+ packets avail */
-
+#define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_EOF         MSG_FIN
 
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exit for file
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34f5db1..36611ab 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -860,7 +860,7 @@ wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	return copied;
 
diff --git a/net/socket.c b/net/socket.c
index 2dce67a..273cbce 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -791,9 +791,9 @@ static ssize_t sock_sendpage(struct file *file, struct page *page,
 
 	sock = file->private_data;
 
-	flags = !(file->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT;
-	if (more)
-		flags |= MSG_MORE;
+	flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
+	/* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
+	flags |= more;
 
 	return kernel_sendpage(sock, page, offset, size, flags);
 }
-- 
1.7.10



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ