[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230508222736.13249-1-kuniyu@amazon.com>
Date: Mon, 8 May 2023 15:27:36 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: "David S. Miller" <davem@...emloft.net>, Eric Dumazet
<edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
<pabeni@...hat.com>, David Ahern <dsahern@...nel.org>
CC: Mubashir Adnan Qureshi <mubashirq@...gle.com>, Neal Cardwell
<ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>, "Kuniyuki
Iwashima" <kuni1840@...il.com>, <netdev@...r.kernel.org>, Jon Zobrist
<zob@...zon.com>
Subject: [PATCH v1 net-next] tcp: Add net.ipv4.tcp_reset_challenge.
Our Network Load Balancer (NLB) [0] consists of multiple nodes with unique
IP addresses. These nodes forward TCP flows from clients to backend
targets by modifying the destination IP address. NLB offers an option [1]
to preserve the client's source IP address and port when routing packets
to backend targets.
When a client connects to two different NLB nodes, they may select the same
backend target. If the client uses the same source IP and port, the two
flows at the backend side will have the same 4-tuple.
+---------------+
1st flow | NLB Node #1 | src: 10.0.0.215:60000
+------------> | 10.0.3.4 | +------------+
| | :10000 | |
+ +---------------+ v
+------------+ +------------+
| Client | | Target |
| 10.0.0.215 | | 10.0.3.249 |
| :60000 | | :10000 |
+------------+ +------------+
+ +---------------+ ^
| | NLB Node #2 | |
+------------> | 10.0.4.62 | +------------+
2nd flow | :10000 | src: 10.0.0.215:60000
+---------------+
The kernel responds to the SYN of the 2nd flow with Challenge ACK. In this
situation, there are multiple valid reply paths, but the flows behind NLB
are tracked to ensure symmetric routing [2]. So, the Challenge ACK is
routed back to the 2nd NLB node.
The 2nd NLB node forwards the Challenge ACK to the client, but the client
sees it as an invalid response to SYN in tcp_rcv_synsent_state_process()
and finally sends RST in tcp_v[46]_do_rcv() based on the sequence number
by tcp_v[46]_send_reset(). The RST effectively closes the first connection
on the target, and a retransmitted SYN successfully establishes the 2nd
connection.
On client:
10.0.0.215.60000 > 10.0.3.4.10000: Flags [S], seq 772948343 ... via NLB Node #1
10.0.3.4.10000 > 10.0.0.215.60000: Flags [S.], seq 3739044674, ack 772948344
10.0.0.215.60000 > 10.0.3.4.10000: Flags [.], ack 3739044675
10.0.0.215.60000 > 10.0.4.62.10000: Flags [S], seq 248180743 ... via NLB Node #2
10.0.4.62.10000 > 10.0.0.215.60000: Flags [.], ack 772948344 ... Invalid Challenge ACK
10.0.0.215.60000 > 10.0.4.62.10000: Flags [R], seq 772948344 ... RST w/ correct seq #
10.0.0.215.60000 > 10.0.4.62.10000: Flags [S], seq 248180743
10.0.4.62.10000 > 10.0.0.215.60000: Flags [S.], seq 4160908213, ack 248180744
10.0.0.215.60000 > 10.0.4.62.10000: Flags [.], ack 4160908214
On target:
10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 772948343 ... via NLB Node #1
10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3739044674, ack 772948344
10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 3739044675
10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 248180743 ... via NLB Node #2
10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 772948344 ... Forwarded to 2nd flow
10.0.0.215.60000 > 10.0.3.249.10000: Flags [R], seq 772948344 ... Close the 1st connection
10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 248180743
10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 4160908213, ack 248180744
10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 4160908214
The first connection is still alive from the client's point of view. When
the client sends data over the first connection, the target responds with
Challenge ACK. The Challenge ACK is routed back to the 1st connection, and
the client responds with Dup ACK, and the target responds to the Dup ACK
with Challenge ACK, and this continues.
On client:
10.0.0.215.60000 > 10.0.3.4.10000: Flags [P.], seq 772948344:772948349, ack 3739044675, length 5
10.0.3.4.10000 > 10.0.0.215.60000: Flags [.], ack 248180744, length 0 ... Challenge ACK
10.0.0.215.60000 > 10.0.3.4.10000: Flags [.], ack 3739044675, length 0 ... Dup ACK
10.0.3.4.10000 > 10.0.0.215.60000: Flags [.], ack 248180744, length 0 ... Challenge ACK
...
In RFC 5961, Challenge ACK assumes that it will be routed back via an
asymmetric path to the peer of the established connection. However, in
a situation where multiple valid reply paths are tracked, Challenge ACK
gives a hint to snipe another connection and also triggers the Challenge
ACK Dup ACK war on the connection.
A new sysctl knob, net.ipv4.tcp_reset_challenge, allows us to respond to
invalid packets described in RFC 5961 with RST and keep the established
socket open.
After sysctl -w net.ipv4.tcp_reset_challenge=1 :
On client:
IP 10.0.0.215.60000 > 10.0.3.4.10000: Flags [S], seq 2603746121 ... via NLB Node #1
IP 10.0.3.4.10000 > 10.0.0.215.60000: Flags [S.], seq 1274876087, ack 2603746122
IP 10.0.0.215.60000 > 10.0.3.4.10000: Flags [.], ack 1274876088
IP 10.0.0.215.60000 > 10.0.4.62.10000: Flags [S], seq 1657678296 ... via NLB Node #2
IP 10.0.4.62.10000 > 10.0.0.215.60000: Flags [R.], seq 0, ack 1657678297
On target:
10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2603746121 ... via NLB Node #1
10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 1274876087, ack 2603746122
10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1274876088
10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 1657678296 ... via NLB Node #2
10.0.3.249.10000 > 10.0.0.215.60000: Flags [R.], seq 0, ack 1657678297
Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
Link: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-connection-tracking.html#automatic-tracking [2]
Suggested-by: Jon Zobrist <zob@...zon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@...zon.com>
---
Documentation/networking/ip-sysctl.rst | 10 ++++++++++
include/net/netns/ipv4.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 9 +++++++++
net/ipv4/tcp_input.c | 20 ++++++++++++++------
net/ipv4/tcp_ipv4.c | 1 +
5 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6ec06a33688a..fe90fdd60cc8 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1045,6 +1045,16 @@ tcp_challenge_ack_limit - INTEGER
TCP stack implements per TCP socket limits anyway.
Default: INT_MAX (unlimited)
+tcp_reset_challenge - BOOLEAN
+ If set, the TCP stack respond to invalid packets decribed in
+ RFC 5961 with RST without close()ing TCP socket.
+
+ This feature is useful where multiple valid reply paths are
+ tracked and the response is sent back via symmetric path to
+ the original sender of the invalid packets.
+
+ Default: 0
+
tcp_ehash_entries - INTEGER
Show the number of hash buckets for TCP sockets in the current
networking namespace.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index db762e35aca9..c70f26f49e9e 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -131,6 +131,7 @@ struct netns_ipv4 {
u8 sysctl_tcp_syncookies;
u8 sysctl_tcp_migrate_req;
u8 sysctl_tcp_comp_sack_nr;
+ u8 sysctl_tcp_reset_challenge;
int sysctl_tcp_reordering;
u8 sysctl_tcp_retries1;
u8 sysctl_tcp_retries2;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 40fe70fc2015..e6cf1cd310bf 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1269,6 +1269,15 @@ static struct ctl_table ipv4_net_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "tcp_reset_challenge",
+ .data = &init_net.ipv4.sysctl_tcp_reset_challenge,
+ .maxlen = sizeof(u8),
+ .mode = 0644,
+ .proc_handler = proc_dou8vec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
{
.procname = "tcp_min_tso_segs",
.data = &init_net.ipv4.sysctl_tcp_min_tso_segs,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a057330d6f59..18370dd7c68b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,7 +98,7 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
#define FLAG_SET_XMIT_TIMER 0x1000 /* Set TLP or RTO timer */
#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
#define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */
-#define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_send_challenge_ack() */
+#define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call tcp_respond_challenge() */
#define FLAG_ACK_MAYBE_DELAYED 0x10000 /* Likely a delayed ACK */
#define FLAG_DSACK_TLP 0x20000 /* DSACK for tail loss probe */
@@ -3658,6 +3658,14 @@ static void tcp_send_challenge_ack(struct sock *sk)
}
}
+static void tcp_respond_challenge(struct sock *sk, struct sk_buff *skb)
+{
+ if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_reset_challenge))
+ sk->sk_prot->rsk_prot->send_reset(sk, skb);
+ else
+ tcp_send_challenge_ack(sk);
+}
+
static void tcp_store_ts_recent(struct tcp_sock *tp)
{
tp->rx_opt.ts_recent = tp->rx_opt.rcv_tsval;
@@ -3757,7 +3765,7 @@ static u32 tcp_newly_delivered(struct sock *sk, u32 prior_delivered, int flag)
}
/* This routine deals with incoming acks, but not outgoing ones. */
-static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
+static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -3788,7 +3796,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
/* RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation] */
if (before(ack, prior_snd_una - tp->max_window)) {
if (!(flag & FLAG_NO_CHALLENGE_ACK))
- tcp_send_challenge_ack(sk);
+ tcp_respond_challenge(sk, skb);
return -SKB_DROP_REASON_TCP_TOO_OLD_ACK;
}
goto old_ack;
@@ -5787,7 +5795,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
if (tp->syn_fastopen && !tp->data_segs_in &&
sk->sk_state == TCP_ESTABLISHED)
tcp_fastopen_active_disable(sk);
- tcp_send_challenge_ack(sk);
+ tcp_respond_challenge(sk, skb);
SKB_DR_SET(reason, TCP_RESET);
goto discard;
}
@@ -5802,7 +5810,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
if (syn_inerr)
TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNCHALLENGE);
- tcp_send_challenge_ack(sk);
+ tcp_respond_challenge(sk, skb);
SKB_DR_SET(reason, TCP_INVALID_SYN);
goto discard;
}
@@ -6542,7 +6550,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
if (!acceptable) {
if (sk->sk_state == TCP_SYN_RECV)
return 1; /* send one RST */
- tcp_send_challenge_ack(sk);
+ tcp_respond_challenge(sk, skb);
SKB_DR_SET(reason, TCP_OLD_ACK);
goto discard;
}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 39bda2b1066e..fcd809d11e46 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3236,6 +3236,7 @@ static int __net_init tcp_sk_init(struct net *net)
/* rfc5961 challenge ack rate limiting, per net-ns, disabled by default. */
net->ipv4.sysctl_tcp_challenge_ack_limit = INT_MAX;
+ net->ipv4.sysctl_tcp_reset_challenge = 0;
net->ipv4.sysctl_tcp_min_tso_segs = 2;
net->ipv4.sysctl_tcp_tso_rtt_log = 9; /* 2^9 = 512 usec */
--
2.30.2
Powered by blists - more mailing lists