lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 18 Apr 2018 10:13:02 +0200 (CEST)
From:   Jozsef Kadlecsik <kadlec@...ckhole.kfki.hu>
To:     Florian Westphal <fw@...len.de>
cc:     Michal Kubecek <mkubecek@...e.cz>, netdev@...r.kernel.org,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: tcp hang when socket fills up ?

Hi,

On Tue, 17 Apr 2018, Florian Westphal wrote:

> Dominique Martinet <asmadeus@...ewreck.org> wrote:
> 
> [ CC Jozsef ]
> 
> > Could it have something to do with the way I setup the connection?
> > I don't think the "both remotes call connect() with carefully selected
> > source/dest port" is a very common case..
> > 
> > If you look at the tcpdump outputs I attached the sequence usually is
> > something like
> >  server > client SYN
> >  client > server SYN
> >  server > client SYNACK
> >  client > server ACK
> > 
> > ultimately it IS a connection, but with an extra SYN packet in front of
> > it (that first SYN opens up the conntrack of the nat so that the
> > client's syn can come in, the client's conntrack will be that of a
> > normal connection since its first SYN goes in directly after the
> > server's (it didn't see the server's SYN))
> > 
> > Looking at my logs again, I'm seeing the same as you:
> > 
> > This looks like the actual SYN/SYN/SYNACK/ACK:
> >  - 14.364090 seq=505004283 likely SYN coming out of server
> >  - 14.661731 seq=1913287797 on next line it says receiver
> > end=505004284 so likely the matching SYN from client
> > Which this time gets a proper SYNACK from server:
> > 14.662020 seq=505004283 ack=1913287798
> > And following final dataless ACK:
> > 14.687570 seq=1913287798 ack=505004284
> > 
> > Then as you point out some data ACK, where the scale poofs:
> > 14.688762 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> > 14.688793 tcp_in_window: sender end=1913287798 maxend=1913316998 maxwin=29312 scale=7 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> > 14.688824 tcp_in_window: 
> > 14.688852 seq=1913287798 ack=505004284+(0) sack=505004284+(0) win=229 end=1913287819
> > 14.688882 tcp_in_window: sender end=1913287819 maxend=1913287819 maxwin=229 scale=0 receiver end=505004284 maxend=505033596 maxwin=29200 scale=7
> >
> > As you say, only tcp_options() will clear only on side of the scales.
> > We don't have sender->td_maxwin == 0 (printed) so I see no other way
> > than we are in the last else if:
> >  - we have after(end, sender->td_end) (end=1913287819 > sender
> > end=1913287798)
> >  - I assume the tcp state machine must be confused because of the
> > SYN/SYN/SYNACK/ACK pattern and we probably enter the next check, 
> > but since this is a data packet it doesn't have the tcp option for scale
> > thus scale resets.
> 
> Yes, this looks correct. Jozsef, can you please have a look?
> 
> Problem seems to be that conntrack believes that ACK packet
> re-initializes the connection:
> 
>  595                 /*
>  596                  * RFC 793: "if a TCP is reinitialized ... then it need
>  597                  * not wait at all; it must only be sure to use sequence
>  598                  * numbers larger than those recently used."
>  599                  */
>  600                 sender->td_end =
>  601                 sender->td_maxend = end;
>  602                 sender->td_maxwin = (win == 0 ? 1 : win);
>  603 
>  604                 tcp_options(skb, dataoff, tcph, sender);
> 
> and last line clears the scale value (no wscale option in data packet).
> 
> 
> Transitions are:
>  server > client SYN          sNO -> sSS
>  client > server SYN          sSS -> sS2
>  server > client SYNACK       sS2 -> sSR /* here */
>  client > server ACK          sSR -> sES
> 
> SYN/ACK was observed in original direction so we hit
> state->state == TCP_CONNTRACK_SYN_RECV && dir == IP_CT_DIR_REPLY test
> when we see the ack packet and end up in the 'TCP is reinitialized' branch.
> 
> AFAICS, without this, connection would move to sES just fine,
> as the data ack is in window.

Yes, the state transition is wrong for simultaneous open, because the 
tcp_conntracks table is not (cannot be) smart enough. Could you verify the 
next untested patch?

diff --git a/include/uapi/linux/netfilter/nf_conntrack_tcp.h b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
index 74b9115..bcba72d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_tcp.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_tcp.h
@@ -46,6 +46,9 @@ enum tcp_conntrack {
 /* Marks possibility for expected RFC5961 challenge ACK */
 #define IP_CT_EXP_CHALLENGE_ACK 		0x40
 
+/* Simultaneous open initialized */
+#define IP_CT_TCP_SIMULTANEOUS_OPEN		0x80
+
 struct nf_ct_tcp_flags {
 	__u8 flags;
 	__u8 mask;
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index e97cdc1..8e67910 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -981,6 +981,17 @@ static int tcp_packet(struct nf_conn *ct,
 			return NF_ACCEPT; /* Don't change state */
 		}
 		break;
+	case TCP_CONNTRACK_SYN_SENT2:
+		/* tcp_conntracks table is not smart enough to handle
+		 * simultaneous open.
+		 */
+		ct->proto.tcp.last_flags |= IP_CT_TCP_SIMULTANEOUS_OPEN;
+		break;
+	case TCP_CONNTRACK_SYN_RECV:
+		if (dir == IP_CT_DIR_REPLY && index == TCP_ACK_SET &&
+		    ct->proto.tcp.last_flags & IP_CT_TCP_SIMULTANEOUS_OPEN)
+			new_state = TCP_CONNTRACK_ESTABLISHED;
+		break;
 	case TCP_CONNTRACK_CLOSE:
 		if (index == TCP_RST_SET
 		    && (ct->proto.tcp.seen[!dir].flags & IP_CT_TCP_FLAG_MAXACK_SET)

Best regards,
Jozsef
-
E-mail  : kadlec@...ckhole.kfki.hu, kadlecsik.jozsef@...ner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ