lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=eyMghCKYEu8oGhDHEMd04suypTdwhUbQPavHDRJpqqjA@mail.gmail.com>
Date:	Tue, 23 Jul 2013 11:24:41 -0700
From:	Yuchung Cheng <ycheng@...gle.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	David Miller <davem@...emloft.net>,
	Rick Jones <rick.jones2@...com>,
	netdev <netdev@...r.kernel.org>,
	Neal Cardwell <ncardwell@...gle.com>,
	Michael Kerrisk <mtk.manpages@...il.com>
Subject: Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option

On Mon, Jul 22, 2013 at 8:27 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
>
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
>
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
>
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
>
> This patch adds two ways to set the limit :
>
> 1) Per socket option TCP_NOTSENT_LOWAT
>
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
>
>
> This changes poll()/select()/epoll() to report POLLOUT
> only if number of unsent bytes is below tp->nosent_lowat
>
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
>
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)
>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput or cpu usage
> of a single flow, although there is an increase of context switches.
>
> A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies of ACK processing.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>            412,514 context-switches
>
>      200.034645535 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>          2,675,818 context-switches
>
>      200.029651391 seconds time elapsed
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> Cc: Yuchung Cheng <ycheng@...gle.com>
Acked-By: Yuchung Cheng <ycheng@...gle.com>

Sorry I acked the wrong (v2) patch previously.

> ---
> v3: use the sk_stream_is_writeable() helper and fix the too many wakeup issue
> v2: title/changelog fix (TCP_NOSENT_LOWAT -> TCP_NOTSENT_LOWAT)
>
>  Documentation/networking/ip-sysctl.txt |   13 +++++++++++++
>  include/linux/tcp.h                    |    1 +
>  include/net/sock.h                     |   19 +++++++++++++------
>  include/net/tcp.h                      |   14 ++++++++++++++
>  include/uapi/linux/tcp.h               |    1 +
>  net/ipv4/sysctl_net_ipv4.c             |    7 +++++++
>  net/ipv4/tcp.c                         |    7 +++++++
>  net/ipv4/tcp_ipv4.c                    |    1 +
>  net/ipv4/tcp_output.c                  |    3 +++
>  net/ipv6/tcp_ipv6.c                    |    1 +
>  10 files changed, 61 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 1074290..53cea9b 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -516,6 +516,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max
>         this value is ignored.
>         Default: between 64K and 4MB, depending on RAM size.
>
> +tcp_notsent_lowat - UNSIGNED INTEGER
> +       A TCP socket can control the amount of unsent bytes in its write queue,
> +       thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> +       reports POLLOUT events if the amount of unsent bytes is below a per
> +       socket value, and if the write queue is not full. sendmsg() will
> +       also not add new buffers if the limit is hit.
> +
> +       This global variable controls the amount of unsent data for
> +       sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> +       to the global variable has immediate effect.
> +
> +       Default: UINT_MAX (0xFFFFFFFF)
> +
>  tcp_workaround_signed_windows - BOOLEAN
>         If set, assume no receipt of a window scaling option means the
>         remote TCP is broken and treats the window as a signed quantity.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 472120b..9640803 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -238,6 +238,7 @@ struct tcp_sock {
>
>         u32     rcv_wnd;        /* Current receiver window              */
>         u32     write_seq;      /* Tail(+1) of data held in tcp send buffer */
> +       u32     notsent_lowat;  /* TCP_NOTSENT_LOWAT */
>         u32     pushed_seq;     /* Last pushed seq, required to talk to windows */
>         u32     lost_out;       /* Lost packets                 */
>         u32     sacked_out;     /* SACK'd packets                       */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index d0b5fde..b9f2b09 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -746,11 +746,6 @@ static inline int sk_stream_wspace(const struct sock *sk)
>
>  extern void sk_stream_write_space(struct sock *sk);
>
> -static inline bool sk_stream_memory_free(const struct sock *sk)
> -{
> -       return sk->sk_wmem_queued < sk->sk_sndbuf;
> -}
> -
>  /* OOB backlog add */
>  static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -950,6 +945,7 @@ struct proto {
>         unsigned int            inuse_idx;
>  #endif
>
> +       bool                    (*stream_memory_free)(const struct sock *sk);
>         /* Memory pressure */
>         void                    (*enter_memory_pressure)(struct sock *sk);
>         atomic_long_t           *memory_allocated;      /* Current allocated memory. */
> @@ -1088,11 +1084,22 @@ static inline struct cg_proto *parent_cg_proto(struct proto *proto,
>  }
>  #endif
>
> +static inline bool sk_stream_memory_free(const struct sock *sk)
> +{
> +       if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> +               return false;
> +
> +       return sk->sk_prot->stream_memory_free ?
> +               sk->sk_prot->stream_memory_free(sk) : true;
> +}
> +
>  static inline bool sk_stream_is_writeable(const struct sock *sk)
>  {
> -       return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk);
> +       return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
> +              sk_stream_memory_free(sk);
>  }
>
> +
>  static inline bool sk_has_memory_pressure(const struct sock *sk)
>  {
>         return sk->sk_prot->memory_pressure != NULL;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c586847..18fc999 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -284,6 +284,7 @@ extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
> +extern unsigned int sysctl_tcp_notsent_lowat;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -1539,6 +1540,19 @@ extern int tcp_gro_complete(struct sk_buff *skb);
>  extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr,
>                                 __be32 daddr);
>
> +static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp)
> +{
> +       return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat;
> +}
> +
> +static inline bool tcp_stream_memory_free(const struct sock *sk)
> +{
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       u32 notsent_bytes = tp->write_seq - tp->snd_nxt;
> +
> +       return notsent_bytes < tcp_notsent_lowat(tp);
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  extern int tcp4_proc_init(void);
>  extern void tcp4_proc_exit(void);
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 8d776eb..377f1e5 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -111,6 +111,7 @@ enum {
>  #define TCP_REPAIR_OPTIONS     22
>  #define TCP_FASTOPEN           23      /* Enable FastOpen on listeners */
>  #define TCP_TIMESTAMP          24
> +#define TCP_NOTSENT_LOWAT      25      /* limit number of unsent bytes in write queue */
>
>  struct tcp_repair_opt {
>         __u32   opt_code;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index b2c123c..69ed203 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -555,6 +555,13 @@ static struct ctl_table ipv4_table[] = {
>                 .extra1         = &one,
>         },
>         {
> +               .procname       = "tcp_notsent_lowat",
> +               .data           = &sysctl_tcp_notsent_lowat,
> +               .maxlen         = sizeof(sysctl_tcp_notsent_lowat),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
> +       {
>                 .procname       = "tcp_rmem",
>                 .data           = &sysctl_tcp_rmem,
>                 .maxlen         = sizeof(sysctl_tcp_rmem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 5eca906..c27e813 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2631,6 +2631,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>                 else
>                         tp->tsoffset = val - tcp_time_stamp;
>                 break;
> +       case TCP_NOTSENT_LOWAT:
> +               tp->notsent_lowat = val;
> +               sk->sk_write_space(sk);
> +               break;
>         default:
>                 err = -ENOPROTOOPT;
>                 break;
> @@ -2847,6 +2851,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>         case TCP_TIMESTAMP:
>                 val = tcp_time_stamp + tp->tsoffset;
>                 break;
> +       case TCP_NOTSENT_LOWAT:
> +               val = tp->notsent_lowat;
> +               break;
>         default:
>                 return -ENOPROTOOPT;
>         }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 2e3f129..2a5d5c4 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2800,6 +2800,7 @@ struct proto tcp_prot = {
>         .unhash                 = inet_unhash,
>         .get_port               = inet_csk_get_port,
>         .enter_memory_pressure  = tcp_enter_memory_pressure,
> +       .stream_memory_free     = tcp_stream_memory_free,
>         .sockets_allocated      = &tcp_sockets_allocated,
>         .orphan_count           = &tcp_orphan_count,
>         .memory_allocated       = &tcp_memory_allocated,
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 92fde8d..884efff 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
>  /* By default, RFC2861 behavior.  */
>  int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
>
> +unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
> +EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
> +
>  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                            int push_one, gfp_t gfp);
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 80fe69e..b792e87 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1924,6 +1924,7 @@ struct proto tcpv6_prot = {
>         .unhash                 = inet_unhash,
>         .get_port               = inet_csk_get_port,
>         .enter_memory_pressure  = tcp_enter_memory_pressure,
> +       .stream_memory_free     = tcp_stream_memory_free,
>         .sockets_allocated      = &tcp_sockets_allocated,
>         .memory_allocated       = &tcp_memory_allocated,
>         .memory_pressure        = &tcp_memory_pressure,
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ