[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a413b206-df50-4445-a4de-494339ea1ce6@linux.dev>
Date: Thu, 11 Jan 2024 22:20:06 -0800
From: Martin KaFai Lau <martin.lau@...ux.dev>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: Kuniyuki Iwashima <kuni1840@...il.com>, bpf@...r.kernel.org,
netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Yonghong Song <yonghong.song@...ux.dev>
Subject: Re: [PATCH v7 bpf-next 0/6] bpf: tcp: Support arbitrary SYN Cookie at
TC.
On 12/20/23 5:28 PM, Kuniyuki Iwashima wrote:
> Under SYN Flood, the TCP stack generates SYN Cookie to remain stateless
> for the connection request until a valid ACK is responded to the SYN+ACK.
>
> The cookie contains two kinds of host-specific bits, a timestamp and
> secrets, so only can it be validated by the generator. It means SYN
> Cookie consumes network resources between the client and the server;
> intermediate nodes must remember which nodes to route ACK for the cookie.
>
> SYN Proxy reduces such unwanted resource allocation by handling 3WHS at
> the edge network. After SYN Proxy completes 3WHS, it forwards SYN to the
> backend server and completes another 3WHS. However, since the server's
> ISN differs from the cookie, the proxy must manage the ISN mappings and
> fix up SEQ/ACK numbers in every packet for each connection. If a proxy
> node goes down, all the connections through it are terminated. Keeping
> a state at proxy is painful from that perspective.
>
> At AWS, we use a dirty hack to build truly stateless SYN Proxy at scale.
> Our SYN Proxy consists of the front proxy layer and the backend kernel
> module. (See slides of LPC2023 [0], p37 - p48)
>
> The cookie that SYN Proxy generates differs from the kernel's cookie in
> that it contains a secret (called rolling salt) (i) shared by all the proxy
> nodes so that any node can validate ACK and (ii) updated periodically so
> that old cookies cannot be validated and we need not encode a timestamp for
> the cookie. Also, ISN contains WScale, SACK, and ECN, not in TS val. This
> is not to sacrifice any connection quality, where some customers turn off
> TCP timestamps option due to retro CVE.
>
> After 3WHS, the proxy restores SYN, encapsulates ACK into SYN, and forward
> the TCP-in-TCP packet to the backend server. Our kernel module works at
> Netfilter input/output hooks and first feeds SYN to the TCP stack to
> initiate 3WHS. When the module is triggered for SYN+ACK, it looks up the
> corresponding request socket and overwrites tcp_rsk(req)->snt_isn with the
> proxy's cookie. Then, the module can complete 3WHS with the original ACK
> as is.
>
> This way, our SYN Proxy does not manage the ISN mappings nor wait for
> SYN+ACK from the backend thus can remain stateless. It's working very
> well for high-bandwidth services like multiple Tbps, but we are looking
> for a way to drop the dirty hack and further optimise the sequences.
>
> If we could validate an arbitrary SYN Cookie on the backend server with
> BPF, the proxy would need not restore SYN nor pass it. After validating
> ACK, the proxy node just needs to forward it, and then the server can do
> the lightweight validation (e.g. check if ACK came from proxy nodes, etc)
> and create a connection from the ACK.
>
> This series allows us to create a full sk from an arbitrary SYN Cookie,
> which is done in 3 steps.
>
> 1) At tc, BPF prog calls a new kfunc to create a reqsk and configure
> it based on the argument populated from SYN Cookie. The reqsk has
> its listener as req->rsk_listener and is passed to the TCP stack as
> skb->sk.
>
> 2) During TCP socket lookup for the skb, skb_steal_sock() returns a
> listener in the reuseport group that inet_reqsk(skb->sk)->rsk_listener
> belongs to.
>
> 3) In cookie_v[46]_check(), the reqsk (skb->sk) is fully initialised and
> a full sk is created.
>
> The kfunc usage is as follows:
>
> struct bpf_tcp_req_attrs attrs = {
> .mss = mss,
> .wscale_ok = wscale_ok,
> .rcv_wscale = rcv_wscale, /* Server's WScale < 15 */
> .snd_wscale = snd_wscale, /* Client's WScale < 15 */
> .tstamp_ok = tstamp_ok,
> .rcv_tsval = tsval,
> .rcv_tsecr = tsecr, /* Server's Initial TSval */
> .usec_ts_ok = usec_ts_ok,
> .sack_ok = sack_ok,
> .ecn_ok = ecn_ok,
> }
>
> skc = bpf_skc_lookup_tcp(...);
> sk = (struct sock *)bpf_skc_to_tcp_sock(skc);
> bpf_sk_assign_tcp_reqsk(skb, sk, attrs, sizeof(attrs));
> bpf_sk_release(skc);
>
> [0]: https://lpc.events/event/17/contributions/1645/attachments/1350/2701/SYN_Proxy_at_Scale_with_BPF.pdf
>
>
> Changes:
> v7:
> * Patch 5 & 6
> * Drop MPTCP support
I think Yonghong's (thanks!) cpuv4 patch
(https://lore.kernel.org/bpf/20240110051348.2737007-1-yonghong.song@linux.dev/)
has addressed the issue that the selftest in patch 6 has encountered.
There are some minor comments in v7. Please respin v8 when the cpuv4 patch has
concluded so that it can kick off the CI also.
Powered by blists - more mailing lists