netdev - Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240716202808.6376-1-kuniyu@amazon.com>
Date: Tue, 16 Jul 2024 13:28:08 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <matttbe@...nel.org>
CC: <davem@...emloft.net>, <dsahern@...nel.org>, <edumazet@...gle.com>,
	<kuba@...nel.org>, <kuni1840@...il.com>, <kuniyu@...zon.com>,
	<netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().

From: Matthieu Baerts <matttbe@...nel.org>
Date: Tue, 16 Jul 2024 22:04:25 +0200
> Hi Kuniyuki,
> 
> Thank you for your reply!
> 
> On 16/07/2024 21:23, Kuniyuki Iwashima wrote:
> > Hi Matthieu,
> > 
> > From: Matthieu Baerts <matttbe@...nel.org>
> > Date: Mon, 15 Jul 2024 17:58:49 +0200
> >> Hi Kuniyuki,
> >>
> >> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
> >>> RFC 9293 states that in the case of simultaneous connect(), the connection
> >>> gets established when SYN+ACK is received. [0]
> >>>
> >>>       TCP Peer A                                       TCP Peer B
> >>>
> >>>   1.  CLOSED                                           CLOSED
> >>>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
> >>>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
> >>>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
> >>>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
> >>>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
> >>>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
> >>>
> >>> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> >>> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> >>> ACK.
> >>>
> >>> For example, the write() syscall in the following packetdrill script fails
> >>> with -EAGAIN, and wrong SNMP stats get incremented.
> >>>
> >>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>>
> >>>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >>>   +0 < S  0:0(0) win 1000 <mss 1000>
> >>>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >>>   +0 < S. 0:0(0) ack 1 win 1000
> >>>
> >>>   +0 write(3, ..., 100) = 100
> >>>   +0 > P. 1:101(100) ack 1
> >>>
> >>>   --
> >>>
> >>>   # packetdrill cross-synack.pkt
> >>>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
> >>>   # nstat
> >>>   ...
> >>>   TcpExtTCPChallengeACK           1                  0.0
> >>>   TcpExtTCPSYNChallenge           1                  0.0
> >>>
> >>> The problem is that bpf_skops_established() is triggered by the Challenge
> >>> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> >>> check if the peer supports a TCP option that is expected to be exchanged
> >>> in SYN and SYN+ACK.
> >>>
> >>> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> >>> such a situation.
> >>>
> >>> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> >>> send an unnecessary ACK, but this could be a bit risky for net.git, so this
> >>> targets for net-next.
> >>>
> >>> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> >>> Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> >>
> >> Thank you for having worked on this patch!
> >>
> >>> ---
> >>>  net/ipv4/tcp_input.c | 9 +++++++++
> >>>  1 file changed, 9 insertions(+)
> >>>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index 47dacb575f74..1eddb6b9fb2a 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >>>  	 * RFC 5961 4.2 : Send a challenge ack
> >>>  	 */
> >>>  	if (th->syn) {
> >>> +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> >>> +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> >>> +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> >>> +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> >>> +			goto pass;
> >>>  syn_challenge:
> >>>  		if (syn_inerr)
> >>>  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> >>> @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >>>  		goto discard;
> >>>  	}
> >>>  
> >>> +pass:
> >>>  	bpf_skops_parse_hdr(sk, skb);
> >>>  
> >>>  	return true;
> >>> @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>  		tcp_fast_path_on(tp);
> >>>  		if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>  			tcp_shutdown(sk, SEND_SHUTDOWN);
> >>> +
> >>> +		if (sk->sk_socket)
> >>> +			goto consume;
> >>
> >> It looks like this modification changes the behaviour for MPTCP Join
> >> requests for listening sockets: when receiving the 3rd ACK of a request
> >> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
> >> MPTCP sock that has been created when the MPTCP connection got created
> >> before with the first path.
> > 
> > Thanks for catching this!
> > 
> > I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
> > processed.
> 
> No problem. That's a shame there was not a clear error in the selftests :)
> 
> > I debugged a bit and confirmed mptcp_stream_accept() sets
> > the inflight subflow's sk->sk_socket with newsk->sk_socket.
> 
> Yes, that's correct.
> 
> >> This new 'goto' here will then skip the
> >> process of the segment text (step 7) and not go through tcp_data_queue()
> >> where the MPTCP options are validated, and some actions are triggered,
> >> e.g. sending the MPJ 4th ACK [1].
> >>
> >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >> delayed,
> > 
> > Yes, the test failure depends on timing.  I only reproduced it by running
> > the test many times on non-kvm qemu.
> 
> Thank you for having checked!
> 
> >> but it looks like it affects the MPTFO feature as well --
> >> probably in case of retransmissions I suppose -- and being the reason
> >> why the selftests started to be unstable the last few days [2].
> >>
> >> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
> >> [2]
> >> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
> >>
> >>
> >> Looking at what this patch here is trying to fix, I wonder if it would
> >> not be enough to apply this patch:
> >>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index ff9ab3d01ced..ff981d7776c3 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>  
> >>> -               if (sk->sk_socket)
> >>> +               if (sk->sk_socket && !sk_is_mptcp(sk))
> >>>                         goto consume;
> >>>                 break;
> >>>  
> >>
> >> But I still need to investigate how the issue that is being addressed by
> >> your patch can be translated to the MPTCP case. I guess we could add
> >> additional checks for MPTCP: new connection or additional path? etc. Or
> >> maybe that's not needed.
> > 
> > My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
> > and the goto is added in v2, which is rather to be more compliant with
> > RFC not to send an unnecessary ACK for simultaneous connect().
> > 
> > So, we could rewrite the condition as this,
> > 
> >   if (sk->sk_socket && !th->syn)
> 
> (Just to be sure, do you mean the opposite with th->syn?)

Ah, yes :)


> 
>   if (sk->sk_socket && th->syn)
>       goto consume;
> 
> That's a good idea!
> 
> I sent my patch a couple of minutes ago [1], then I saw your suggestion
> here. It looks like it should work for the TFO case as well. Maybe your
> suggestion is more generic and will cover more cases?

Likely, I was just trying to avoid unnecessary change and make the effect
minimal.


> 
> [1]
> https://lore.kernel.org/all/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org/
> 
> > but I think your patch is better to give a hint that MPTCP has a
> > different logic.
> 
> Because TFO has also a different logic, it might be good to have a clear
> comment about that.

Exactly, TFO could be accept()ed before receiving ACK, and then
we must not drop ACK w/ data.


> 
> > Also, a similar check done before the goto, and this could be
> > improved ?
> > 
> >   if (sk->sk_socket)
> >     sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
> 
> It is a bit late for me, but I think it can be kept as it is: for MPTCP,
> it will not wake up the userspace as the subflow is managed by the
> kernel. I would need to check if we could avoid that. Also, will this
> wakeup not be useful for TFO?

Yes, I think it's not necessary for TFO.