[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM_iQpVShCqWx1CYUOO9SrgWw7ztVP6j=v=W9dAd9FbChGZauQ@mail.gmail.com>
Date: Wed, 24 Mar 2021 17:44:24 -0700
From: Cong Wang <xiyou.wangcong@...il.com>
To: John Fastabend <john.fastabend@...il.com>
Cc: Andrii Nakryiko <andrii@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Alexei Starovoitov <ast@...com>, bpf <bpf@...r.kernel.org>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Lorenz Bauer <lmb@...udflare.com>
Subject: Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@...il.com> wrote:
>
> Incorrect accounting fwd_alloc can result in a warning when the socket
> is torn down,
>
> [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
> [...]
> [18455.319543] Call Trace:
> [18455.319556] inet_csk_destroy_sock+0xba/0x1f0
> [18455.319577] tcp_rcv_state_process+0x1b4e/0x2380
> [18455.319593] ? lock_downgrade+0x3a0/0x3a0
> [18455.319617] ? tcp_finish_connect+0x1e0/0x1e0
> [18455.319631] ? sk_reset_timer+0x15/0x70
> [18455.319646] ? tcp_schedule_loss_probe+0x1b2/0x240
> [18455.319663] ? lock_release+0xb2/0x3f0
> [18455.319676] ? __release_sock+0x8a/0x1b0
> [18455.319690] ? lock_downgrade+0x3a0/0x3a0
> [18455.319704] ? lock_release+0x3f0/0x3f0
> [18455.319717] ? __tcp_close+0x2c6/0x790
> [18455.319736] ? tcp_v4_do_rcv+0x168/0x370
> [18455.319750] tcp_v4_do_rcv+0x168/0x370
> [18455.319767] __release_sock+0xbc/0x1b0
> [18455.319785] __tcp_close+0x2ee/0x790
> [18455.319805] tcp_close+0x20/0x80
>
> This currently happens because on redirect case we do skb_set_owner_r()
> with the original sock. This increments the fwd_alloc memory accounting
> on the original sock. Then on redirect we may push this into the queue
> of the psock we are redirecting to. When the skb is flushed from the
> queue we give the memory back to the original sock. The problem is if
> the original sock is destroyed/closed with skbs on another psocks queue
> then the original sock will not have a way to reclaim the memory before
> being destroyed. Then above warning will be thrown
>
> sockA sockB
>
> sk_psock_strp_read()
> sk_psock_verdict_apply()
> -- SK_REDIRECT --
> sk_psock_skb_redirect()
> skb_queue_tail(psock_other->ingress_skb..)
>
> sk_close()
> sock_map_unref()
> sk_psock_put()
> sk_psock_drop()
> sk_psock_zap_ingress()
>
> At this point we have torn down our own psock, but have the outstanding
> skb in psock_other. Note that SK_PASS doesn't have this problem because
> the sk_psock_drop() logic releases the skb, its still associated with
> our psock.
>
> To resolve lets only account for sockets on the ingress queue that are
> still associated with the current socket. On the redirect case we will
> check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> or received with sk_psock_skb_ingress memory will be claimed on psock_other.
You mean sk_psock_skb_ingress(), right?
>
> Reported-by: Andrii Nakryiko <andrii@...nel.org>
> Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
> Signed-off-by: John Fastabend <john.fastabend@...il.com>
> ---
> net/core/skmsg.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 1261512d6807..f150b5b63561 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
> if (unlikely(!msg))
> return -EAGAIN;
> sk_msg_init(msg);
> + skb_set_owner_r(skb, sk);
> return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
> }
>
> @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int
> {
> switch (verdict) {
> case __SK_REDIRECT:
> - skb_set_owner_r(skb, sk);
> sk_psock_skb_redirect(skb);
> break;
> case __SK_PASS:
> @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
> rcu_read_lock();
> prog = READ_ONCE(psock->progs.skb_verdict);
> if (likely(prog)) {
> - /* We skip full set_owner_r here because if we do a SK_PASS
> - * or SK_DROP we can skip skb memory accounting and use the
> - * TLS context.
> - */
> skb->sk = psock->sk;
> tcp_skb_bpf_redirect_clear(skb);
> ret = sk_psock_bpf_run(psock, prog, skb);
> @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
> kfree_skb(skb);
> goto out;
> }
> - skb_set_owner_r(skb, sk);
> prog = READ_ONCE(psock->progs.skb_verdict);
> if (likely(prog)) {
> + skb->sk = psock->sk;
Why is skb_orphan() not needed here?
Nit: You can just use 'sk' here, so "skb->sk = sk".
> tcp_skb_bpf_redirect_clear(skb);
> ret = sk_psock_bpf_run(psock, prog, skb);
> ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
> + skb->sk = NULL;
Why do you want to set it to NULL here?
Thanks.
Powered by blists - more mailing lists