netdev - Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Tue, 3 Jul 2018 18:38:53 +0000
From:   Lawrence Brakmo <brakmo@...com>
To:     Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>
CC:     Netdev <netdev@...r.kernel.org>, Kernel Team <Kernel-team@...com>,
        "Blake Matheny" <bmatheny@...com>, Alexei Starovoitov <ast@...com>,
        Steve Ibanez <sibanez@...nford.edu>,
        Eric Dumazet <eric.dumazet@...il.com>,
        "Daniel Borkmann" <daniel@...earbox.net>
Subject: Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

On 7/3/18, 6:15 AM, "Neal Cardwell" <ncardwell@...gle.com> wrote:

    On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng <ycheng@...gle.com> wrote:
    >
    > On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo <brakmo@...com> wrote:
    > >
    > > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
    > > notifications to keep track if it needs to send an ACK for packets that
    > > were received with a particular ECN state but whose ACK was delayed.
    > >
    > > Under some circumstances, for example when a delayed ACK is sent with a
    > > data packet, DCTCP state was not being updated due to a lack of
    > > notification that the previously delayed ACK was sent. As a result, it
    > > would sometimes send a duplicate ACK when a new data packet arrived.
    > >
    > > This patch insures that DCTCP's state is correctly updated so it will
    > > not send the duplicate ACK.
    > Sorry to chime-in late here (lame excuse: IETF deadline)
    >
    > IIRC this issue would exist prior to 4.11 kernel. While it'd be good
    > to fix that, it's not clear which patch introduced the regression
    > between 4.11 and 4.16? I assume you tested Eric's most recent quickack
    > fix.
    
    I don't think Larry is saying there's a regression between 4.11 and
    4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
    patch here:
    
      https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_935233_&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=7ko5C_ln2b7gZvz2A_UrZzz0AlcnhrW7-9KRahj_PEA&s=HcpZvo1TulN4-Y7Jhba5KM1MIaPwnBC95T8pLZfJESI&e=
    
    said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
    was still bad but with different netstat counters (no RTOs but still
    high tail latency):
    
    """
    On 4.11, pcap traces indicate that in some instances the 1st packet of
    the RPC is received but no ACK is sent before the packet is
    retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
    spurious.
    
    On 4.16, we don't see retransmits in netstat but the high tail latencies
    are still there.
    """
    
    I suspect the RTOs disappeared but latencies remained too high because
    between 4.11 and 4.16 we introduced:
      tcp: allow TLP in ECN CWR
      https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd
    
    So the RTOs probably disappeared because this commit turned them into
    TLPs. But the latencies remained high because the fundamental bug
    remained throughout 4.11 and 4.16 and today: the DCTCP use of
    tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
    when they should not have been.
    
    > In terms of the fix itself, it seems odd the tcp_send_ack() call in
    > DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
    > delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
    > "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
    > cancelled, because that prior-ACK really is a supplementary old ACK.
    
    This patch is fixing an issue that's orthogonal to the one you are
    talking about. Using the taxonomy from our team's internal discussion
    yesterday, the issue you mention where the DCTCP "prior" ACK is
    cancelling delayed ACKs is issue (4); the issue that this particular
    "tcp: notify when a delayed ack is sent" patch from Larry fixes is
    issue (3). It's a bit tricky because both issues appeared in Larry's
    trace summary and packetdrill script to reproduce the issue.
    
    neal

I was able to track the patch that introduced the problem:

    commit 3759824da87b30ce7a35b4873b62b0ba38905ef5
    Author: Yuchung Cheng <ycheng@...gle.com>
    Date:   Wed Jul 1 14:11:15 2015 -0700

        tcp: PRR uses CRB mode by default and SS mode conditionally

I tested a kernel which reverted the relevant change (see diff below) and the high tail latencies of more than 40ms disappeared. However, the 10KB high percentile latencies are 4ms vs. less than 200us with my patches. It looks like the patch above ended up reducing the cwnd to 1 in the scenarios that were triggering the high tail latencies. That is, it increased the likelihood of triggering actual bugs in the network stack code that my patch set fixes.


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..50fabb07d739 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2468,13 +2468,14 @@ void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
                u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
                               tp->prior_cwnd - 1;
                sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
-       } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
-                  !(flag & FLAG_LOST_RETRANS)) {
+       } else {
+//     } else if ((flag & FLAG_RETRANS_DATA_ACKED) &&
+//                !(flag & FLAG_LOST_RETRANS)) {
                sndcnt = min_t(int, delta,
                               max_t(int, tp->prr_delivered - tp->prr_out,
                                     newly_acked_sacked) + 1);
-       } else {
-               sndcnt = min(delta, newly_acked_sacked);
+//     } else {
+//             sndcnt = min(delta, newly_acked_sacked);
        }
        /* Force a fast retransmit upon entering fast recovery */
        sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));