netdev - Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <22216D9A-6758-4A72-B0DB-8D0E1CC16752@fb.com>
Date:   Sat, 4 May 2019 02:33:43 +0000
From:   Lawrence Brakmo <brakmo@...com>
To:     Neal Cardwell <ncardwell@...gle.com>
CC:     netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>,
        "Alexei Starovoitov" <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        "Eric Dumazet" <eric.dumazet@...il.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        "Kernel Team" <Kernel-team@...com>
Subject: Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop

Sorry for the delay, due to a combination of changing teams and getting pneumonia. Answers inline, but note that I will be sending a patchset for review that does not include these changes to the probe timer.

On 4/8/19, 9:17 AM, "Neal Cardwell" <ncardwell@...gle.com> wrote:

    On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@...com> wrote:
    >
    > When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
    > and packets_out is 0, it is beneficial to set a small probe timer.
    > Otherwise, the throughput for the flow can suffer because it may need to
    > depend on the probe timer to start sending again. The default value for
    > the probe timer is at least 200ms, this patch sets it to 20ms when a
    > packet is dropped and there are no other packets in flight.
    >
    > This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
    > used to specify the duration of the probe timer for the case described
    > earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
    > disables setting the probe timer with a small value.
    >
    > Signed-off-by: Lawrence Brakmo <brakmo@...com>
    > ---
    >  include/net/netns/ipv4.h   |  1 +
    >  net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
    >  net/ipv4/tcp_ipv4.c        |  1 +
    >  net/ipv4/tcp_output.c      | 18 +++++++++++++++---
    >  4 files changed, 27 insertions(+), 3 deletions(-)
    >
    > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
    > index 7698460a3dd1..26c52d294dea 100644
    > --- a/include/net/netns/ipv4.h
    > +++ b/include/net/netns/ipv4.h
    > @@ -166,6 +166,7 @@ struct netns_ipv4 {
    >         int sysctl_tcp_wmem[3];
    >         int sysctl_tcp_rmem[3];
    >         int sysctl_tcp_comp_sack_nr;
    > +       int sysctl_tcp_probe_on_drop_ms;
    >         unsigned long sysctl_tcp_comp_sack_delay_ns;
    >         struct inet_timewait_death_row tcp_death_row;
    >         int sysctl_max_syn_backlog;
    > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
    > index 2316c08e9591..b1e4420b6d6c 100644
    > --- a/net/ipv4/sysctl_net_ipv4.c
    > +++ b/net/ipv4/sysctl_net_ipv4.c
    > @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
    >  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
    >  static int comp_sack_nr_max = 255;
    >  static u32 u32_max_div_HZ = UINT_MAX / HZ;
    > +static int probe_on_drop_max = TCP_RTO_MIN;
    >
    >  /* obsolete */
    >  static int sysctl_tcp_low_latency __read_mostly;
    > @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
    >                 .extra1         = &zero,
    >                 .extra2         = &comp_sack_nr_max,
    >         },
    > +       {
    > +               .procname       = "tcp_probe_on_drop_ms",
    > +               .data           = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
    > +               .maxlen         = sizeof(int),
    > +               .mode           = 0644,
    > +               .proc_handler   = proc_dointvec_minmax,
    
    The tcp_reset_xmit_timer () function expects a time in jiffies, so
    this would probably need to be proc_dointvec_ms_jiffies?

Yes, good catch.
    
    > +               .extra1         = &zero,
    > +               .extra2         = &probe_on_drop_max,
    > +       },
    >         {
    >                 .procname       = "udp_rmem_min",
    >                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
    > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
    > index 3979939804b7..3df6735f0f07 100644
    > --- a/net/ipv4/tcp_ipv4.c
    > +++ b/net/ipv4/tcp_ipv4.c
    > @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
    >         spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
    >         net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
    >         atomic_set(&net->ipv4.tfo_active_disable_times, 0);
    > +       net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;
    
    The tcp_reset_xmit_timer () function expects a time in jiffies, so
    this would probably need to be msecs_to_jiffies(20)?

Correct, thanks.
    
    >
    >         /* Reno is always built in */
    >         if (!net_eq(net, &init_net) &&
    > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
    > index e265d1aeeb66..f101e4d7c1ae 100644
    > --- a/net/ipv4/tcp_output.c
    > +++ b/net/ipv4/tcp_output.c
    > @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
    >
    >         err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
    >
    > -       if (unlikely(err > 0)) {
    > -               tcp_enter_cwr(sk);
    > -               err = net_xmit_eval(err);
    > +       if (unlikely(err)) {
    > +               if (unlikely(err > 0)) {
    > +                       tcp_enter_cwr(sk);
    > +                       err = net_xmit_eval(err);
    > +               }
    > +               /* Packet was dropped. If there are no packets out,
    > +                * we may need to depend on probe timer to start sending
    > +                * again. Hence, use a smaller value.
    > +                */
    > +               if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
    > +                   sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
    > +                       tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
    > +                                            sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
    > +                                            TCP_RTO_MAX, NULL);
    > +               }
    
    My reading of the code is that any timer that was set here would
    typically be quickly overridden by the tcp_check_probe_timer() that
    usually happens after most tcp_write_xmit() calls. Am I reading that
    correctly?

No. tcp_check_probe_timer() first checks if there is a pending timer and 
Will not overwrite it.
    
    There seems to be a philosophical difference between this patch and
    the existing logic in tcp_check_probe_timer()/tcp_probe0_base():
    
    /* Something is really bad, we could not queue an additional packet,
     * because qdisc is full or receiver sent a 0 window, or we are paced.
     * We do not want to add fuel to the fire, or abort too early,
     * so make sure the timer we arm now is at least 200ms in the future,
     * regardless of current icsk_rto value (as it could be ~2ms)
     */
    static inline unsigned long tcp_probe0_base(const struct sock *sk)
    {
      return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
    }
    ...
    static inline void tcp_check_probe_timer(struct sock *sk)
    {
            if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
                    tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                                         tcp_probe0_base(sk), TCP_RTO_MAX,
                                         NULL);
    }
    
    If we apply some version of this patch then we should probably at
    least update this comment above tcp_probe0_base() to clarify the new
    strategy.
    
Thanks for pointing this out. In the use case we consider packets can be
dropped as a result of applying a rate limit, so in this case drops are not a
sign that things are really bad (and in my measurements do not occur very
often). In retrospect I should have set the default sysctl value to 0 so the
behavior would not change unless one desires it by setting the sysctl to a
non-zero value.

    cheers,
    neal