netdev - Re: [PATCH net] ipv6: gro: flush instead of assuming different flows on hop

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iJY=oDHY+Fe=u+GHeb07LCUC305rwLehsE2Wq1TcidP8Q@mail.gmail.com>
Date:   Fri, 21 Jan 2022 08:37:12 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     David Miller <davem@...emloft.net>,
        David Ahern <dsahern@...il.com>,
        Paolo Abeni <pabeni@...hat.com>,
        Herbert Xu <herbert@...dor.apana.org.au>,
        netdev <netdev@...r.kernel.org>,
        Yuchung Cheng <ycheng@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net] ipv6: gro: flush instead of assuming different flows
 on hop_limit mismatch

On Fri, Jan 21, 2022 at 7:15 AM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Fri, 21 Jan 2022 00:55:08 -0800 Eric Dumazet wrote:
> > On Thu, Jan 20, 2022 at 5:19 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > > IPv6 GRO considers packets to belong to different flows when their
> > > hop_limit is different. This seems counter-intuitive, the flow is
> > > the same. hop_limit may vary because of various bugs or hacks but
> > > that doesn't mean it's okay for GRO to reorder packets.
> > >
> > > Practical impact of this problem on overall TCP performance
> > > is unclear, but TCP itself detects this reordering and bumps
> > > TCPSACKReorder resulting in user complaints.
> > >
> > > Note that the code plays an easy to miss trick by upcasting next_hdr
> > > to a u16 pointer and compares next_hdr and hop_limit in one go.
> > > Coalesce the flush setting to reduce the instruction count a touch.
> >
> > There are downsides to this change.
> >
> > We had an internal discussion at Google years ago about this
> > difference in behavior of IPv6/IPv4
> >
> > We came to the conclusion the IPv6 behavior was better for our needs
> > (and we do not care
> > much about IPv4 GRO, since Google DC traffic is 99.99% IPv6)
> >
> > In our case, we wanted to keep this 'ipv6 feature' because we were
> > experimenting with the idea of sending
> > TSO packets with different flowlabels, to use different paths in the
> > network, to increase nominal
> > throughput for WAN flows (one flow would use multiple fiber links)
> >
> > The issue with 'ipv4 gro style about ttl mismatch' was that because of
> > small differences in RTT for each path,
> >  a receiver could very well receive mixed packets.
> >
> > Even without playing with ECMP hashes, this scenario can happen if the sender
> > uses a bonding device in balance-rr mode.
> >
> > After your change, GRO would be defeated and deliver one MSS at a time
> > to TCP stack.
>
> Indeed. Sounds like we're trading correctness for an optimization of a
> questionable practical application, but our motivation isn't 100% pure
> either [1] so whatever way we can fix this is fine by me :)
>
> [1] We have some shenanigans that bump TTL to indicate re-transmitted
> packets so we can identify them in the network.
>
> > We implemented SACK compress in TCP stack to avoid extra SACK being
> > sent by the receiver
> >
> > We have an extension of this SACK compression for TCP flows terminated
> > by Google servers,
> > since modern TCP stacks do not need the old rule of TCP_FASTRETRANS_THRESH
> > DUPACK to start retransmits.
> >
> > Something like this pseudo code:
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index dc49a3d551eb919baf5ad812ef21698c5c7b9679..d72554ab70fd2e16ed60dc78a905f4aa1414f8c9
> > 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5494,7 +5494,8 @@ static void __tcp_ack_snd_check(struct sock *sk,
> > int ofo_possible)
> >         }
> >         if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
> >                 tp->dup_ack_counter++;
> > -               goto send_now;
> > +               if (peer_is_using_old_rule_about_fastretrans(tp))
> > +                       goto send_now;
> >         }
> >         tp->compressed_ack++;
> >         if (hrtimer_is_queued(&tp->compressed_ack_timer))
> >
>
> Is this something we could upstream / test? peer_is_using.. does not
> exist upstream.

Sure, because we do not have a standardized way (at SYN SYNACK time)
to advertise
that the stack is not 10 years old.

This could be a per net-ns sysctl, or a per socket flag, or a per cgroup flag.

In our case, we do negotiate special TCP options, and allow these options
only from internal communications.

(So we store this private bit in the socket itself)

>
>
> Coincidentally, speaking of sending SACKs, my initial testing was on
> 5.12 kernels and there I saw what appeared to a lay person (me) like
> missing ACKs. Receiver would receive segments:
>
> _AB_C_D_E
>
> where _ indicates loss. It'd SACK A, then generate the next SACK after E
> (SACKing C D E), sender would rexmit A which makes receiver ACK all
> the way to the end of B. Now sender thinks B arrived after CDE because
> it was never sacked.
>
> Perhaps it was fixed by commit a29cb6914681 ("net: tcp better handling
> of reordering then loss cases").. or it's a result of some out-of-tree
> hack. I thought I'd mention it tho in case it immediately rings a bell
> for anyone.

Could all the missing SACK have been lost ?

Writing a packetdrill test for this scenario should not be too hard.