[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sfwxfk7m.fsf@cloudflare.com>
Date: Tue, 19 Oct 2021 11:16:29 +0200
From: Jakub Sitnicki <jakub@...udflare.com>
To: John Fastabend <john.fastabend@...il.com>
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org, daniel@...earbox.net,
joamaki@...il.com, xiyou.wangcong@...il.com
Subject: Re: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive
verdict with redirect to self
On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> A socket in a sockmap may have different combinations of programs
> attached depending on configuration. There can be no programs in which
> case the socket acts as a sink only. There can be a TX program in this
> case a BPF program is attached to sending side, but no RX program is
> attached. There can be an RX program only where sends have no BPF
> program attached, but receives are hooked with BPF. And finally,
> both TX and RX programs may be attached. Giving us the permutations,
>
> None, Tx, Rx, and TxRx
>
> To date most of our use cases have been TX case being used as a fast
> datapath to directly copy between local application and a userspace
> proxy. Or Rx cases and TxRX applications that are operating an in
> kernel based proxy. The traffic in the first case where we hook
> applications into a userspace application looks like this,
>
> AppA redirect AppB
> Tx <-----------> Rx
> | |
> + +
> TCP <--> lo <--> TCP
>
> In this case all traffic from AppA (after 3whs) is copied into the
> AppB ingress queue and no traffic is ever on the TCP recieive_queue.
>
> In the second case the application never receives, except in some
> rare error cases, traffic on the actual user space socket. Instead
> the send happens in the kernel.
>
> AppProxy socket pool
> sk0 ------------->{sk1,sk2, skn}
> ^ |
> | |
> | v
> ingress lb egress
> TCP TCP
>
> Here because traffic is never read off the socket with userspace
> recv() APIs there is only ever one reader on the sk receive_queue.
> Namely the BPF programs.
>
> However, we've started to introduce a third configuration where the
> BPF program on receive should process the data, but then the normal
> case is to push the data into the receive queue of AppB.
>
> AppB
> recv() (userspace)
> -----------------------
> tcp_bpf_recvmsg() (kernel)
> | |
> | |
> | |
> ingress_msgQ |
> | |
> RX_BPF |
> | |
> v v
> sk->receive_queue
>
>
> This is different from the App{A,B} redirect because traffic is
> first received on the sk->receive_queue.
>
> Now for the issue. The tcp_bpf_recvmsg() handler first checks the
> ingress_msg queue for any data handled by the BPF rx program and
> returned with PASS code so that it was enqueued on the ingress msg
> queue. Then if no data exists on that queue it checks the socket
> receive queue. Unfortunately, this is the same receive_queue the
> BPF program is reading data off of. So we get a race. Its possible
> for the recvmsg() hook to pull data off the receive_queue before
> the BPF hook has a chance to read it. It typically happens when
> an application is banging on recv() and getting EAGAINs. Until
> they manage to race with the RX BPF program.
>
> To fix this we note that before this patch at attach time when
> the socket is loaded into the map we check if it needs a TX
> program or just the base set of proto bpf hooks. Then it uses
> the above general RX hook regardless of if we have a BPF program
> attached at rx or not. This patch now extends this check to
> handle all cases enumerated above, TX, RX, TXRX, and none. And
> to fix above race when an RX program is attached we use a new
> hook that is nearly identical to the old one except now we
> do not let the recv() call skip the RX BPF program. Now only
> the BPF program pulls data from sk->receive_queue and recv()
> only pulls data from the ingress msgQ post BPF program handling.
>
> With this resolved our AppB from above has been up and running
> for many hours without detecting any errors. We do this by
> correlating counters in RX BPF events and the AppB to ensure
> data is never skipping the BPF program. Selftests, was not
> able to detect this because we only run them for a short
> period of time on well ordered send/recvs so we don't get any
> of the noise we see in real application environments.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> ---
Acked-by: Jakub Sitnicki <jakub@...udflare.com>
Powered by blists - more mailing lists