netdev - Re: [PATCH net-next] tcp: reduce cpu usage when SO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1466461744.6850.29.camel@edumazet-glaptop3.roam.corp.google.com>
Date:	Mon, 20 Jun 2016 15:29:04 -0700
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Jason Baron <jbaron@...mai.com>
Cc:	davem@...emloft.net, netdev@...r.kernel.org
Subject: Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set

On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
> From: Jason Baron <jbaron@...mai.com>
> 
> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
> write buffer space can be much lower than what was set using SO_SNDBUF. For
> example, we may have set the buffer to 100kb, but we may only be able to
> write 10kb. In this scenario poll()/select()/epoll(), are going to
> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
> result in a tight loop. Note that epoll in edge-triggered does not have
> this issue since it only triggers once data has been ack'd. There is no
> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
> the sk->sndbuf.
> 
> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
> socket when we have a short write due to memory pressure. By then testing
> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
> non-zero amount of data has been ack'd. In a previous approach:
> http://marc.info/?l=linux-netdev&m=143930393211782&w=2, I had introduced a
> new field in 'struct sock' to solve this issue, but its undesirable to add
> bloat to 'struct sock'. We also could address this issue, by waiting for
> the buffer to become completely empty, but that may reduce throughput since
> the write buffer would be empty while waiting for subsequent writes. This
> change brings us in line with the current epoll edge-trigger behavior for
> the poll()/select() and epoll level-trigger.
> 
> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
> 
> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
> server side, to induce tcp memory pressure. A single server thread reduced
> its cpu usage from 100% to 19%, while maintaining the same level of
> throughput.
> 
> Signed-off-by: Jason Baron <jbaron@...mai.com>
> ---


When this bug was added, and by which commit ?

This looks serious, but your patch looks way too complex.

This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
is expensive, because it dirties sk_flags which is not supposed to
written often. This field is located in a read mostly cache line.

I believe my suggestion was not to add a per socket bit,
but have a global state like tcp_memory_pressure. (or reuse it)

Ie , when under global memory pressure, tcp_poll() should apply a
strategy to give POLLOUT only on sockets having less than 4K
(tcp_wmem[0]) of memory consumed in their write queue.