netdev - Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1466616854.6850.69.camel@edumazet-glaptop3.roam.corp.google.com>
Date:	Wed, 22 Jun 2016 10:34:14 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Jason Baron <jbaron@...mai.com>
Cc:	davem@...emloft.net, netdev@...r.kernel.org
Subject: Re: [PATCH v2 net-next 2/2] tcp: reduce cpu usage when SO_SNDBUF is
 set

On Wed, 2016-06-22 at 11:32 -0400, Jason Baron wrote:
> From: Jason Baron <jbaron@...mai.com>
> 
> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
> write buffer space can be much lower than what was set using SO_SNDBUF. For
> example, we may have set the buffer to 100kb, but we may only be able to
> write 10kb. In this scenario poll()/select()/epoll(), are going to
> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
> result in a tight loop. Note that epoll in edge-triggered does not have
> this issue since it only triggers once data has been ack'd. There is no
> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
> the sk->sndbuf.


Still, generating one POLLOUT event per incoming ACK will not please
epoll() users in edge-trigger mode.

Host is under global memory pressure, so we probably want to drain
socket queues _and_ reduce cpu pressure.

Strategy to insure all sockets converge to small amounts ASAP is simply
the best answer.

Letting big SO_SNDBUF offenders hog memory while their queue is big
is not going to help sockets who can not get ACK
(elephants get more ACK than mice, so they have more chance to succeed
their new allocations)

Your patch adds lot of complexity logic in tcp_sendmsg() and
tcp_sendpage().


I would prefer a simpler patch like :


 net/ipv4/tcp.c |   26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed147449c..68bf035180d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -370,6 +370,26 @@ static int retrans_to_secs(u8 retrans, int timeout, int rto_max)
 	return period;
 }
 
+static bool tcp_throttle_writes(const struct sock *sk)
+{
+	return unlikely(tcp_memory_pressure &&
+			sk->sk_wmem_queued >= sysctl_tcp_wmem[0]);
+}
+
+static bool tcp_is_writeable(const struct sock *sk)
+{
+	if (tcp_throttle_writes(sk))
+		return false;
+	return sk_stream_is_writeable(sk);
+}
+
+static void tcp_write_space(struct sock *sk)
+{
+	if (tcp_throttle_writes(sk))
+		return;
+	sk_stream_write_space(sk);
+}
+
 /* Address-family independent initialization for a tcp_sock.
  *
  * NOTE: A lot of things set to zero explicitly by call to
@@ -412,7 +432,7 @@ void tcp_init_sock(struct sock *sk)
 
 	sk->sk_state = TCP_CLOSE;
 
-	sk->sk_write_space = sk_stream_write_space;
+	sk->sk_write_space = tcp_write_space;
 	sock_set_flag(sk, SOCK_USE_WRITE_QUEUE);
 
 	icsk->icsk_sync_mss = tcp_sync_mss;
@@ -517,7 +537,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 			mask |= POLLIN | POLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-			if (sk_stream_is_writeable(sk)) {
+			if (tcp_is_writeable(sk)) {
 				mask |= POLLOUT | POLLWRNORM;
 			} else {  /* send SIGIO later */
 				sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -529,7 +549,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 				 * pairs with the input side.
 				 */
 				smp_mb__after_atomic();
-				if (sk_stream_is_writeable(sk))
+				if (tcp_is_writeable(sk))
 					mask |= POLLOUT | POLLWRNORM;
 			}
 		} else