netdev - Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240716192320.54815-1-kuniyu@amazon.com>
Date: Tue, 16 Jul 2024 12:23:20 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <matttbe@...nel.org>
CC: <davem@...emloft.net>, <dsahern@...nel.org>, <edumazet@...gle.com>,
	<kuba@...nel.org>, <kuni1840@...il.com>, <kuniyu@...zon.com>,
	<netdev@...r.kernel.org>, <pabeni@...hat.com>
Subject: Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().

Hi Matthieu,

From: Matthieu Baerts <matttbe@...nel.org>
Date: Mon, 15 Jul 2024 17:58:49 +0200
> Hi Kuniyuki,
> 
> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
> > RFC 9293 states that in the case of simultaneous connect(), the connection
> > gets established when SYN+ACK is received. [0]
> > 
> >       TCP Peer A                                       TCP Peer B
> > 
> >   1.  CLOSED                                           CLOSED
> >   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
> >   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
> >   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
> >   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
> >   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
> >   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
> > 
> > However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> > SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> > ACK.
> > 
> > For example, the write() syscall in the following packetdrill script fails
> > with -EAGAIN, and wrong SNMP stats get incremented.
> > 
> >    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> > 
> >   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >   +0 < S  0:0(0) win 1000 <mss 1000>
> >   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >   +0 < S. 0:0(0) ack 1 win 1000
> > 
> >   +0 write(3, ..., 100) = 100
> >   +0 > P. 1:101(100) ack 1
> > 
> >   --
> > 
> >   # packetdrill cross-synack.pkt
> >   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
> >   # nstat
> >   ...
> >   TcpExtTCPChallengeACK           1                  0.0
> >   TcpExtTCPSYNChallenge           1                  0.0
> > 
> > The problem is that bpf_skops_established() is triggered by the Challenge
> > ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> > check if the peer supports a TCP option that is expected to be exchanged
> > in SYN and SYN+ACK.
> > 
> > Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> > such a situation.
> > 
> > Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> > send an unnecessary ACK, but this could be a bit risky for net.git, so this
> > targets for net-next.
> > 
> > Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
> 
> Thank you for having worked on this patch!
> 
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 47dacb575f74..1eddb6b9fb2a 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >  	 * RFC 5961 4.2 : Send a challenge ack
> >  	 */
> >  	if (th->syn) {
> > +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> > +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> > +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> > +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> > +			goto pass;
> >  syn_challenge:
> >  		if (syn_inerr)
> >  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> > @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >  		goto discard;
> >  	}
> >  
> > +pass:
> >  	bpf_skops_parse_hdr(sk, skb);
> >  
> >  	return true;
> > @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >  		tcp_fast_path_on(tp);
> >  		if (sk->sk_shutdown & SEND_SHUTDOWN)
> >  			tcp_shutdown(sk, SEND_SHUTDOWN);
> > +
> > +		if (sk->sk_socket)
> > +			goto consume;
>
> It looks like this modification changes the behaviour for MPTCP Join
> requests for listening sockets: when receiving the 3rd ACK of a request
> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
> MPTCP sock that has been created when the MPTCP connection got created
> before with the first path.

Thanks for catching this!

I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
processed.  I debugged a bit and confirmed mptcp_stream_accept() sets
the inflight subflow's sk->sk_socket with newsk->sk_socket.


> This new 'goto' here will then skip the
> process of the segment text (step 7) and not go through tcp_data_queue()
> where the MPTCP options are validated, and some actions are triggered,
> e.g. sending the MPJ 4th ACK [1].
> 
> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> delayed,

Yes, the test failure depends on timing.  I only reproduced it by running
the test many times on non-kvm qemu.


> but it looks like it affects the MPTFO feature as well --
> probably in case of retransmissions I suppose -- and being the reason
> why the selftests started to be unstable the last few days [2].
> 
> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
> [2]
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
> 
> 
> Looking at what this patch here is trying to fix, I wonder if it would
> not be enough to apply this patch:
> 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ff9ab3d01ced..ff981d7776c3 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >  
> > -               if (sk->sk_socket)
> > +               if (sk->sk_socket && !sk_is_mptcp(sk))
> >                         goto consume;
> >                 break;
> >  
> 
> But I still need to investigate how the issue that is being addressed by
> your patch can be translated to the MPTCP case. I guess we could add
> additional checks for MPTCP: new connection or additional path? etc. Or
> maybe that's not needed.

My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
and the goto is added in v2, which is rather to be more compliant with
RFC not to send an unnecessary ACK for simultaneous connect().

So, we could rewrite the condition as this,

  if (sk->sk_socket && !th->syn)

but I think your patch is better to give a hint that MPTCP has a
different logic.

Also, a similar check done before the goto, and this could be
improved ?

  if (sk->sk_socket)
    sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);


Thanks!