netdev - Re: tcp hang when socket fills up ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180413094255.GA17273@nautica>
Date:   Fri, 13 Apr 2018 18:42:55 +0900
From:   Dominique Martinet <asmadeus@...ewreck.org>
To:     netdev@...r.kernel.org
Subject: Re: tcp hang when socket fills up ?


Note this is mostly me talking to myself, but in case anyone else hits
the same issues I might as well post my meagre progress.

Dominique Martinet wrote on Fri, Apr 06, 2018:
> (current kernel: vanilla 4.14.29)

reproduced in a fedora VM on that host with a 4.16.1-300.fc28.x86_64
kernel, since this one has DEBUG_INFO=y and I was lazy (but haven't seen
any patch about that kind of stuff recently so probably still valid)

Other main difference is the qdisc, VM is using fq_codel, host is
directly on wireless so mq with 4 pfifo_fast queues - it is harder to
reproduce on the VM (even on another VM with the same kernel) so I'd
put the difference down to the qdisc, but I was able to reproduce with
both ultimately.
(update: it actually was still fairly easy to reproduce until it got
later (coworkers left?), going from ~5-15s to reproduce to multiple
minutes, so this likely depends on net quality a lot. I couldn't
reproduce by fiddling with netem on a local network though...)

> [please find previous email for setup/tcpdump output]

So I have a crash dump with a socket / inet_sock that are blocked, since
this is a VM I even get gdb in bonus..

With the crash dump, I can confirm that the socket is not available for
writing (sk->sk_sndbuf - sk->sk_wmem_queued < sk->sk_wmem_queued >> 1),
but that doesn't help much if I can't tell why we're not taking acks in

With gdb, I set a breakpoint to tcp_ack (net/ipv4/tcp_input.c) as I
think that's the function that should handle my ack, and that gets the
replay.

First, abusing next, the flow seems to be something like
(I folded if/else not taken)

static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
{       
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        struct tcp_sacktag_state sack_state;
        struct rate_sample rs = { .prior_delivered = 0 };
        u32 prior_snd_una = tp->snd_una;
        bool is_sack_reneg = tp->is_sack_reneg;
        u32 ack_seq = TCP_SKB_CB(skb)->seq;
        u32 ack = TCP_SKB_CB(skb)->ack_seq;
        bool is_dupack = false;
        int prior_packets = tp->packets_out;
        u32 delivered = tp->delivered;
        u32 lost = tp->lost;
        int rexmit = REXMIT_NONE; /* Flag to (re)transmit to recover
		losses */
        u32 prior_fack;

        sack_state.first_sackt = 0;
        sack_state.rate = &rs;

        /* We very likely will need to access rtx queue. */
        prefetch(sk->tcp_rtx_queue.rb_node);

        /* If the ack is older than previous acks
         * then we can probably ignore it.
         */
        if (before(ack, prior_snd_una)) {
		}

        /* If the ack includes data we haven't sent yet, discard
         * this segment (RFC793 Section 3.9).
         */
        if (after(ack, tp->snd_nxt))

        if (after(ack, prior_snd_una)) {
        }

        prior_fack = tcp_is_sack(tp) ? tcp_highest_sack_seq(tp) :
		tp->snd_una;
        rs.prior_in_flight = tcp_packets_in_flight(tp);

        /* ts_recent update must be made after we are sure that the
		packet                           
         * is in window.
         */
        if (flag & FLAG_UPDATE_TS_RECENT)

        if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
        } else {
                u32 ack_ev_flags = CA_ACK_SLOWPATH;

                if (ack_seq != TCP_SKB_CB(skb)->end_seq)
                else
                        NET_INC_STATS(sock_net(sk),
						LINUX_MIB_TCPPUREACKS);

                flag |= tcp_ack_update_window(sk, skb, ack, ack_seq);

                if (TCP_SKB_CB(skb)->sacked)

                if (tcp_ecn_rcv_ecn_echo(tp, tcp_hdr(skb))) {
                }

                if (flag & FLAG_WIN_UPDATE)

                tcp_in_ack_event(sk, ack_ev_flags);
        }
        
        /* We passed data and got it acked, remove any soft error
         * log. Something worked...
         */
        sk->sk_err_soft = 0;
        icsk->icsk_probes_out = 0;
        tp->rcv_tstamp = tcp_jiffies32;
        if (!prior_packets)
                goto no_queue;

no_queue:
        /* If data was DSACKed, see if we can undo a cwnd reduction. */
        if (flag & FLAG_DSACKING_ACK)
                tcp_fastretrans_alert(sk, prior_snd_una, is_dupack,
				&flag,
                                      &rexmit);
        /* If this ack opens up a zero window, clear backoff.  It was
         * being used to time the probes, and is probably far higher
		 than
         * it needs to be for normal retransmission.
         */
        tcp_ack_probe(sk);

        if (tp->tlp_high_seq)
                tcp_process_tlp_ack(sk, ack, flag);
        return 1;


And here is 'info local' towards the end of the function:
icsk = 0xffff88001740b800
tp = 0xffff88001740b800
sack_state = {reord = 84, first_sackt = 0, last_sackt = 0, rate = 0xffff88003fc83ae0, flag = 0, mss_now = 0}
rs = {prior_mstamp = 0, prior_delivered = 0, delivered = 0, interval_us = 0, rtt_us = 0, losses = 0, acked_sacked = 0, prior_in_flight = 0, is_app_limited = false, is_retrans = false, is_ack_delayed = false}
prior_snd_una = 2896339291
is_sack_reneg = false
ack_seq = 2651638615
ack = 2896339291
is_dupack = false
prior_packets = 0
delivered = 172
lost = 0
rexmit = 0
prior_fack = 2896339291
__vpp_verify = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>
pao_ID__ = <optimized out>
pao_tmp__ = <optimized out>


So as I was seeing, we have ack = prior_fack = prior_snd_una (which is > ack_seq)

We're in the !prior_packets goto to no_queue though so we don't pass
tcp_clean_rtx_queue - so that explains why the replays don't help. I'm
ok with that. But all these traces actually don't tell me why the first
time we saw that packed didn't take the ack and decreate sk_wmem_queued
properly.

This isn't easy enough to reproduce that I feel confident grabbing the
first call to that function that fails though, so I guess I'll actually
have to read the code I'm debugging instead of trying to just plow
through it...

Hints still welcome, if anyone has an idea.

-- 
Dominique Martinet | Asmadeus