netdev - Re: [PATCH net] net: tcp: don't allocate fast clones for fastopen SYN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210303160715.2333d0ca@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Wed, 3 Mar 2021 16:07:15 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        David Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        kernel-team <kernel-team@...com>, Neil Spring <ntspring@...com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net] net: tcp: don't allocate fast clones for fastopen
 SYN

On Wed, 3 Mar 2021 13:35:53 -0800 Alexander Duyck wrote:
> On Tue, Mar 2, 2021 at 1:37 PM Eric Dumazet <edumazet@...gle.com> wrote:
> > On Tue, Mar 2, 2021 at 7:08 AM Jakub Kicinski <kuba@...nel.org> wrote:  
> > > When receiver does not accept TCP Fast Open it will only ack
> > > the SYN, and not the data. We detect this and immediately queue
> > > the data for (re)transmission in tcp_rcv_fastopen_synack().
> > >
> > > In DC networks with very low RTT and without RFS the SYN-ACK
> > > may arrive before NIC driver reported Tx completion on
> > > the original SYN. In which case skb_still_in_host_queue()
> > > returns true and sender will need to wait for the retransmission
> > > timer to fire milliseconds later.
> > >
> > > Revert back to non-fast clone skbs, this way
> > > skb_still_in_host_queue() won't prevent the recovery flow
> > > from completing.
> > >
> > > Suggested-by: Eric Dumazet <edumazet@...gle.com>
> > > Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")  
> >
> > Hmmm, not sure if this Fixes: tag makes sense.
> >
> > Really, if we delay TX completions by say 10 ms, other parts of the
> > stack will misbehave anyway.
> >
> > Also, backporting this patch up to linux-3.19 is going to be tricky.
> >
> > The real issue here is that skb_still_in_host_queue() can give a false positive.
> >
> > I have mixed feelings here, as you can read my answer :/
> >
> > Maybe skb_still_in_host_queue() signal should not be used when a part
> > of the SKB has been received/acknowledged by the remote peer
> > (in this case the SYN part).
> >
> > Alternative is that drivers unable to TX complete their skbs in a
> > reasonable time should call skb_orphan()
> >  to avoid skb_unclone() penalties (and this skb_still_in_host_queue() issue)
> >
> > If you really want to play and delay TX completions, maybe provide a
> > way to disable skb_still_in_host_queue() globally,
> > using a static key ?  
> 
> The problem as I see it is that the original fclone isn't what we sent
> out on the wire and that is confusing things. What we sent was a SYN
> with data, but what we have now is just a data frame that hasn't been
> put out on the wire yet.

Not sure I understand why it's the key distinction here. Is it
re-transmitting part of the frame or having different flags?
Is re-transmit of half of a GSO skb also considered not the same?

To me the distinction is that the receiver has implicitly asked
us for the re-transmission. If it was requested by SACK we should 
ignore "in_queue" for the first transmission as well, even if the
skb state is identical.

> I wonder if we couldn't get away with doing something like adding a
> fourth option of SKB_FCLONE_MODIFIED that we could apply to fastopen
> skbs? That would keep the skb_still_in_host queue from triggering as
> we would be changing the state from SKB_FCLONE_ORIG to
> SKB_FCLONE_MODIFIED for the skb we store in the retransmit queue. In
> addition if we have to clone it again and the fclone reference count
> is 1 we could reset it back to SKB_FCLONE_ORIG.

The unused value of fclone was tempting me as well :)

AFAICT we have at least these options:

1 - don't use a fclone skb [v1]

2 - mark the fclone as "special" at Tx to escape the "in queue" check

3 - indicate to retansmit that we're sure initial tx is out [v2]

4 - same as above but with a bool / flag instead of negative seg

5 - use the fclone bits but mark them at Rx when we see a rtx request

6 - check the skb state in retransmit to match the TFO case (IIUC
    Yuchung's suggestion)

#5 is my favorite but I didn't know how to extend it to fast
re-transmits so I just stuck to the suggestion from the ML :)

WDYT? Eric, Yuchung?