netdev - Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQynFtNiQxsRNx7phxsxgSRXowFag1=qbw0WrHyWHOnZ7Lw@mail.gmail.com>
Date:   Mon, 8 Apr 2019 12:16:12 -0400
From:   Neal Cardwell <ncardwell@...gle.com>
To:     brakmo <brakmo@...com>
Cc:     netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop

On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@...com> wrote:
>
> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
> and packets_out is 0, it is beneficial to set a small probe timer.
> Otherwise, the throughput for the flow can suffer because it may need to
> depend on the probe timer to start sending again. The default value for
> the probe timer is at least 200ms, this patch sets it to 20ms when a
> packet is dropped and there are no other packets in flight.
>
> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
> used to specify the duration of the probe timer for the case described
> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
> disables setting the probe timer with a small value.
>
> Signed-off-by: Lawrence Brakmo <brakmo@...com>
> ---
>  include/net/netns/ipv4.h   |  1 +
>  net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
>  net/ipv4/tcp_ipv4.c        |  1 +
>  net/ipv4/tcp_output.c      | 18 +++++++++++++++---
>  4 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 7698460a3dd1..26c52d294dea 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -166,6 +166,7 @@ struct netns_ipv4 {
>         int sysctl_tcp_wmem[3];
>         int sysctl_tcp_rmem[3];
>         int sysctl_tcp_comp_sack_nr;
> +       int sysctl_tcp_probe_on_drop_ms;
>         unsigned long sysctl_tcp_comp_sack_delay_ns;
>         struct inet_timewait_death_row tcp_death_row;
>         int sysctl_max_syn_backlog;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2316c08e9591..b1e4420b6d6c 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
>  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
>  static int comp_sack_nr_max = 255;
>  static u32 u32_max_div_HZ = UINT_MAX / HZ;
> +static int probe_on_drop_max = TCP_RTO_MIN;
>
>  /* obsolete */
>  static int sysctl_tcp_low_latency __read_mostly;
> @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = &zero,
>                 .extra2         = &comp_sack_nr_max,
>         },
> +       {
> +               .procname       = "tcp_probe_on_drop_ms",
> +               .data           = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be proc_dointvec_ms_jiffies?

> +               .extra1         = &zero,
> +               .extra2         = &probe_on_drop_max,
> +       },
>         {
>                 .procname       = "udp_rmem_min",
>                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 3979939804b7..3df6735f0f07 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
>         net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
>         atomic_set(&net->ipv4.tfo_active_disable_times, 0);
> +       net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be msecs_to_jiffies(20)?

>
>         /* Reno is always built in */
>         if (!net_eq(net, &init_net) &&
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e265d1aeeb66..f101e4d7c1ae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
>
>         err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
>
> -       if (unlikely(err > 0)) {
> -               tcp_enter_cwr(sk);
> -               err = net_xmit_eval(err);
> +       if (unlikely(err)) {
> +               if (unlikely(err > 0)) {
> +                       tcp_enter_cwr(sk);
> +                       err = net_xmit_eval(err);
> +               }
> +               /* Packet was dropped. If there are no packets out,
> +                * we may need to depend on probe timer to start sending
> +                * again. Hence, use a smaller value.
> +                */
> +               if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
> +                   sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
> +                       tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
> +                                            sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
> +                                            TCP_RTO_MAX, NULL);
> +               }

My reading of the code is that any timer that was set here would
typically be quickly overridden by the tcp_check_probe_timer() that
usually happens after most tcp_write_xmit() calls. Am I reading that
correctly?

There seems to be a philosophical difference between this patch and
the existing logic in tcp_check_probe_timer()/tcp_probe0_base():

/* Something is really bad, we could not queue an additional packet,
 * because qdisc is full or receiver sent a 0 window, or we are paced.
 * We do not want to add fuel to the fire, or abort too early,
 * so make sure the timer we arm now is at least 200ms in the future,
 * regardless of current icsk_rto value (as it could be ~2ms)
 */
static inline unsigned long tcp_probe0_base(const struct sock *sk)
{
  return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
}
...
static inline void tcp_check_probe_timer(struct sock *sk)
{
        if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
                tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                                     tcp_probe0_base(sk), TCP_RTO_MAX,
                                     NULL);
}

If we apply some version of this patch then we should probably at
least update this comment above tcp_probe0_base() to clarify the new
strategy.

cheers,
neal