[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+ZK6+bPQds2fbff=-ojJ=W=czUvrWPyOCTno=qO6yzDQ@mail.gmail.com>
Date: Tue, 9 May 2023 15:42:45 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, David Ahern <dsahern@...nel.org>,
Mubashir Adnan Qureshi <mubashirq@...gle.com>, Neal Cardwell <ncardwell@...gle.com>,
Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org, Jon Zobrist <zob@...zon.com>
Subject: Re: [PATCH v1 net-next] tcp: Add net.ipv4.tcp_reset_challenge.
On Tue, May 9, 2023 at 3:34 PM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On Mon, 2023-05-08 at 15:27 -0700, Kuniyuki Iwashima wrote:
> > Our Network Load Balancer (NLB) [0] consists of multiple nodes with unique
> > IP addresses. These nodes forward TCP flows from clients to backend
> > targets by modifying the destination IP address. NLB offers an option [1]
> > to preserve the client's source IP address and port when routing packets
> > to backend targets.
> >
> > When a client connects to two different NLB nodes, they may select the same
> > backend target. If the client uses the same source IP and port, the two
> > flows at the backend side will have the same 4-tuple.
> >
> > +---------------+
> > 1st flow | NLB Node #1 | src: 10.0.0.215:60000
> > +------------> | 10.0.3.4 | +------------+
> > | | :10000 | |
> > + +---------------+ v
> > +------------+ +------------+
> > | Client | | Target |
> > | 10.0.0.215 | | 10.0.3.249 |
> > | :60000 | | :10000 |
> > +------------+ +------------+
> > + +---------------+ ^
> > | | NLB Node #2 | |
> > +------------> | 10.0.4.62 | +------------+
> > 2nd flow | :10000 | src: 10.0.0.215:60000
> > +---------------+
> >
> > The kernel responds to the SYN of the 2nd flow with Challenge ACK. In this
> > situation, there are multiple valid reply paths, but the flows behind NLB
> > are tracked to ensure symmetric routing [2]. So, the Challenge ACK is
> > routed back to the 2nd NLB node.
> >
> > The 2nd NLB node forwards the Challenge ACK to the client, but the client
> > sees it as an invalid response to SYN in tcp_rcv_synsent_state_process()
> > and finally sends RST in tcp_v[46]_do_rcv() based on the sequence number
> > by tcp_v[46]_send_reset(). The RST effectively closes the first connection
> > on the target, and a retransmitted SYN successfully establishes the 2nd
> > connection.
> >
> > On client:
> > 10.0.0.215.60000 > 10.0.3.4.10000: Flags [S], seq 772948343 ... via NLB Node #1
> > 10.0.3.4.10000 > 10.0.0.215.60000: Flags [S.], seq 3739044674, ack 772948344
> > 10.0.0.215.60000 > 10.0.3.4.10000: Flags [.], ack 3739044675
> >
> > 10.0.0.215.60000 > 10.0.4.62.10000: Flags [S], seq 248180743 ... via NLB Node #2
> > 10.0.4.62.10000 > 10.0.0.215.60000: Flags [.], ack 772948344 ... Invalid Challenge ACK
> > 10.0.0.215.60000 > 10.0.4.62.10000: Flags [R], seq 772948344 ... RST w/ correct seq #
> > 10.0.0.215.60000 > 10.0.4.62.10000: Flags [S], seq 248180743
> > 10.0.4.62.10000 > 10.0.0.215.60000: Flags [S.], seq 4160908213, ack 248180744
> > 10.0.0.215.60000 > 10.0.4.62.10000: Flags [.], ack 4160908214
> >
> > On target:
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 772948343 ... via NLB Node #1
> > 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3739044674, ack 772948344
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 3739044675
> >
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 248180743 ... via NLB Node #2
> > 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 772948344 ... Forwarded to 2nd flow
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [R], seq 772948344 ... Close the 1st connection
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 248180743
> > 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 4160908213, ack 248180744
> > 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 4160908214
> >
> > The first connection is still alive from the client's point of view. When
> > the client sends data over the first connection, the target responds with
> > Challenge ACK. The Challenge ACK is routed back to the 1st connection, and
> > the client responds with Dup ACK, and the target responds to the Dup ACK
> > with Challenge ACK, and this continues.
> >
> > On client:
> > 10.0.0.215.60000 > 10.0.3.4.10000: Flags [P.], seq 772948344:772948349, ack 3739044675, length 5
> > 10.0.3.4.10000 > 10.0.0.215.60000: Flags [.], ack 248180744, length 0 ... Challenge ACK
> > 10.0.0.215.60000 > 10.0.3.4.10000: Flags [.], ack 3739044675, length 0 ... Dup ACK
> > 10.0.3.4.10000 > 10.0.0.215.60000: Flags [.], ack 248180744, length 0 ... Challenge ACK
> > ...
> >
> > In RFC 5961, Challenge ACK assumes that it will be routed back via an
> > asymmetric path to the peer of the established connection. However, in
> > a situation where multiple valid reply paths are tracked, Challenge ACK
> > gives a hint to snipe another connection and also triggers the Challenge
> > ACK Dup ACK war on the connection.
> >
> > A new sysctl knob, net.ipv4.tcp_reset_challenge, allows us to respond to
> > invalid packets described in RFC 5961 with RST and keep the established
> > socket open.
>
> I did not double check with the RFC, but the above looks like a knob to
> enable a protocol violation.
>
> I'm wondering if the same results could be obtained with a BPF program
> instead?
>
> IMHO we should avoid adding system wide knobs for such specific use-
> case, especially when the controlled behaviour is against the spec.
>
Agreed, this patch looks quite suspect to me.
We will then add many more knobs for other similar situations.
Network Load Balancers can be tricky to implement right.
We should not have to tweak many TCP stacks just because of one implementation.
Maglev (one of the load balancers used at Google) never asked for a
modification of a TCP stack.
I think such a change would need IETF discussion and approval first.
Powered by blists - more mailing lists