netdev - Re: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87sfwxfk7m.fsf@cloudflare.com>
Date:   Tue, 19 Oct 2021 11:16:29 +0200
From:   Jakub Sitnicki <jakub@...udflare.com>
To:     John Fastabend <john.fastabend@...il.com>
Cc:     bpf@...r.kernel.org, netdev@...r.kernel.org, daniel@...earbox.net,
        joamaki@...il.com, xiyou.wangcong@...il.com
Subject: Re: [PATCH bpf 2/4] bpf, sockmap: Fix race in ingress receive
 verdict with redirect to self


On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> A socket in a sockmap may have different combinations of programs
> attached depending on configuration. There can be no programs in which
> case the socket acts as a sink only. There can be a TX program in this
> case a BPF program is attached to sending side, but no RX program is
> attached. There can be an RX program only where sends have no BPF
> program attached, but receives are hooked with BPF. And finally,
> both TX and RX programs may be attached. Giving us the permutations,
>
>  None, Tx, Rx, and TxRx
>
> To date most of our use cases have been TX case being used as a fast
> datapath to directly copy between local application and a userspace
> proxy. Or Rx cases and TxRX applications that are operating an in
> kernel based proxy. The traffic in the first case where we hook
> applications into a userspace application looks like this,
>
>   AppA  redirect   AppB
>    Tx <-----------> Rx
>    |                |
>    +                +
>    TCP <--> lo <--> TCP
>
> In this case all traffic from AppA (after 3whs) is copied into the
> AppB ingress queue and no traffic is ever on the TCP recieive_queue.
>
> In the second case the application never receives, except in some
> rare error cases, traffic on the actual user space socket. Instead
> the send happens in the kernel.
>
>            AppProxy       socket pool
>        sk0 ------------->{sk1,sk2, skn}
>         ^                      |
>         |                      |
>         |                      v
>        ingress              lb egress
>        TCP                  TCP
>
> Here because traffic is never read off the socket with userspace
> recv() APIs there is only ever one reader on the sk receive_queue.
> Namely the BPF programs.
>
> However, we've started to introduce a third configuration where the
> BPF program on receive should process the data, but then the normal
> case is to push the data into the receive queue of AppB.
>
>        AppB
>        recv()                (userspace)
>      -----------------------
>        tcp_bpf_recvmsg()     (kernel)
>          |             |
>          |             |
>          |             |
>        ingress_msgQ    |
>          |             |
>        RX_BPF          |
>          |             |
>          v             v
>        sk->receive_queue
>
>
> This is different from the App{A,B} redirect because traffic is
> first received on the sk->receive_queue.
>
> Now for the issue. The tcp_bpf_recvmsg() handler first checks the
> ingress_msg queue for any data handled by the BPF rx program and
> returned with PASS code so that it was enqueued on the ingress msg
> queue. Then if no data exists on that queue it checks the socket
> receive queue. Unfortunately, this is the same receive_queue the
> BPF program is reading data off of. So we get a race. Its possible
> for the recvmsg() hook to pull data off the receive_queue before
> the BPF hook has a chance to read it. It typically happens when
> an application is banging on recv() and getting EAGAINs. Until
> they manage to race with the RX BPF program.
>
> To fix this we note that before this patch at attach time when
> the socket is loaded into the map we check if it needs a TX
> program or just the base set of proto bpf hooks. Then it uses
> the above general RX hook regardless of if we have a BPF program
> attached at rx or not. This patch now extends this check to
> handle all cases enumerated above, TX, RX, TXRX, and none. And
> to fix above race when an RX program is attached we use a new
> hook that is nearly identical to the old one except now we
> do not let the recv() call skip the RX BPF program. Now only
> the BPF program pulls data from sk->receive_queue and recv()
> only pulls data from the ingress msgQ post BPF program handling.
>
> With this resolved our AppB from above has been up and running
> for many hours without detecting any errors. We do this by
> correlating counters in RX BPF events and the AppB to ensure
> data is never skipping the BPF program. Selftests, was not
> able to detect this because we only run them for a short
> period of time on well ordered send/recvs so we don't get any
> of the noise we see in real application environments.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> ---

Acked-by: Jakub Sitnicki <jakub@...udflare.com>