netdev - Re: [BUG bpf-next] bpf/net: Hitting gpf when running selftests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64f0f60b1417c_d45ab2086b@john.notmuch>
Date: Thu, 31 Aug 2023 13:20:27 -0700
From: John Fastabend <john.fastabend@...il.com>
To: John Fastabend <john.fastabend@...il.com>, 
 Xu Kuohai <xukuohai@...wei.com>, 
 Jiri Olsa <olsajiri@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>, 
 Daniel Borkmann <daniel@...earbox.net>, 
 Andrii Nakryiko <andrii@...nel.org>, 
 netdev@...r.kernel.org, 
 bpf@...r.kernel.org, 
 Martin KaFai Lau <kafai@...com>, 
 Song Liu <songliubraving@...com>, 
 Yonghong Song <yhs@...com>, 
 John Fastabend <john.fastabend@...il.com>, 
 KP Singh <kpsingh@...omium.org>, 
 Stanislav Fomichev <sdf@...gle.com>, 
 Hao Luo <haoluo@...gle.com>, 
 Hou Tao <houtao1@...wei.com>, 
 Bobby Eshleman <bobby.eshleman@...edance.com>
Subject: Re: [BUG bpf-next] bpf/net: Hitting gpf when running selftests

John Fastabend wrote:
> Xu Kuohai wrote:
> > On 8/31/2023 5:46 PM, Xu Kuohai wrote:
> > > On 8/31/2023 5:07 AM, Jiri Olsa wrote:
> > >> On Wed, Aug 30, 2023 at 08:58:11PM +0200, Jiri Olsa wrote:
> > >>> hi,
> > >>> I'm hitting crash below on bpf-next/master when running selftests,
> > >>> full log and config attached
> > >>
> > >> it seems to be 'test_progs -t sockmap_listen' triggering that
> > >>
> > >> jirka
> > >>
> > >>>
> > >>> jirka
> > >>>
> > >>>
> > >>> ---
> > >>> [ 1022.710250][ T2556] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b73: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI^M
> > >>> [ 1022.711206][ T2556] CPU: 2 PID: 2556 Comm: kworker/2:4 Tainted: G           OE      6.5.0+ #693 1723c8b9805ff5a1672ab7e6f25977078a7bcceb^M
> > >>> [ 1022.712120][ T2556] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014^M
> > >>> [ 1022.712830][ T2556] Workqueue: events sk_psock_backlog^M
> > >>> [ 1022.713262][ T2556] RIP: 0010:skb_dequeue+0x4c/0x80^M
> > >>> [ 1022.713653][ T2556] Code: 41 48 85 ed 74 3c 8b 43 10 4c 89 e7 83 e8 01 89 43 10 48 8b 45 08 48 8b 55 00 48 c7 45 08 00 00 00 00 48 c7 45 00 00 00 00 00 <48> 89 42 08 48 89 10 e8 e8 6a 41 00 48 89 e8 5b 5d 41 5c c3 cc cc^M
> > >>> [ 1022.714963][ T2556] RSP: 0018:ffffc90003ca7dd0 EFLAGS: 00010046^M
> > >>> [ 1022.715431][ T2556] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88811de269d0 RCX: 0000000000000000^M
> > >>> [ 1022.716068][ T2556] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000282 RDI: ffff88811de269e8^M
> > >>> [ 1022.716676][ T2556] RBP: ffff888141ae39c0 R08: 0000000000000001 R09: 0000000000000000^M
> > >>> [ 1022.717283][ T2556] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88811de269e8^M
> > >>> [ 1022.717930][ T2556] R13: 0000000000000001 R14: ffff888141ae39c0 R15: ffff88810a20e640^M
> > >>> [ 1022.718549][ T2556] FS:  0000000000000000(0000) GS:ffff88846d600000(0000) knlGS:0000000000000000^M
> > >>> [ 1022.719241][ T2556] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> > >>> [ 1022.719761][ T2556] CR2: 00007fb5c25ca000 CR3: 000000012b902004 CR4: 0000000000770ee0^M
> > >>> [ 1022.720394][ T2556] PKRU: 55555554^M
> > >>> [ 1022.720699][ T2556] Call Trace:^M
> > >>> [ 1022.720984][ T2556]  <TASK>^M
> > >>> [ 1022.721254][ T2556]  ? die_addr+0x32/0x80^M
> > >>> [ 1022.721589][ T2556]  ? exc_general_protection+0x25a/0x4b0^M
> > >>> [ 1022.722026][ T2556]  ? asm_exc_general_protection+0x22/0x30^M
> > >>> [ 1022.722489][ T2556]  ? skb_dequeue+0x4c/0x80^M
> > >>> [ 1022.722854][ T2556]  sk_psock_backlog+0x27a/0x300^M
> > >>> [ 1022.723243][ T2556]  process_one_work+0x2a7/0x5b0^M
> > >>> [ 1022.723633][ T2556]  worker_thread+0x4f/0x3a0^M
> > >>> [ 1022.723998][ T2556]  ? __pfx_worker_thread+0x10/0x10^M
> > >>> [ 1022.724386][ T2556]  kthread+0xfd/0x130^M
> > >>> [ 1022.724709][ T2556]  ? __pfx_kthread+0x10/0x10^M
> > >>> [ 1022.725066][ T2556]  ret_from_fork+0x2d/0x50^M
> > >>> [ 1022.725409][ T2556]  ? __pfx_kthread+0x10/0x10^M
> > >>> [ 1022.725799][ T2556]  ret_from_fork_asm+0x1b/0x30^M
> > >>> [ 1022.726201][ T2556]  </TASK>^M
> > >>
> > >>
> > >> .
> > > 
> > > My patch failed on the BPF CI, and the log shows the test also died in skb_dequeue:
> > > 
> > > https://github.com/kernel-patches/bpf/actions/runs/6031993528/job/16366782122
> > > 
> > > [...]
> > > 
> > >    [   74.396478]  ? __die_body+0x1f/0x70
> > >    [   74.396700]  ? page_fault_oops+0x15b/0x450
> > >    [   74.396957]  ? fixup_exception+0x26/0x330
> > >    [   74.397211]  ? exc_page_fault+0x68/0x1a0
> > >    [   74.397457]  ? asm_exc_page_fault+0x26/0x30
> > >    [   74.397724]  ? skb_dequeue+0x52/0x90
> > >    [   74.397954]  sk_psock_destroy+0x8c/0x2b0
> > >    [   74.398204]  process_one_work+0x28a/0x550
> > >    [   74.398458]  ? __pfx_worker_thread+0x10/0x10
> > >    [   74.398730]  worker_thread+0x51/0x3c0
> > >    [   74.398966]  ? __pfx_worker_thread+0x10/0x10
> > >    [   74.399235]  kthread+0xf7/0x130
> > >    [   74.399437]  ? __pfx_kthread+0x10/0x10
> > >    [   74.399707]  ret_from_fork+0x34/0x50
> > >    [   74.399967]  ? __pfx_kthread+0x10/0x10
> > >    [   74.400234]  ret_from_fork_asm+0x1b/0x30
> > > 
> > > 
> > > After a few tries, I found a way to reproduce the problem.
> > > 
> > > Here is the reproduce steps:
> > > 
> > > 1. create a kprobe to delay sk_psock_backlog:
> > > 
> > > static struct kprobe kp = {
> > >          .symbol_name = "sk_psock_backlog",
> > >          .offset = 0x00,
> > > };
> > > 
> > > static int handler_pre(struct kprobe *p, struct pt_regs *regs)
> > > {
> > >          mdelay(1000);
> > >          return 0;
> > > }
> > > 
> > > static int __init kprobe_init(void)
> > > {
> > >          int ret;
> > > 
> > >          kp.pre_handler = handler_pre;
> > > 
> > >          ret = register_kprobe(&kp);
> > >          if (ret < 0) {
> > >                  return -1;
> > >          }
> > > 
> > >          return 0;
> > > }
> > > 
> > > 2. insert the kprobe and run the vsock sockmap test:
> > > 
> > > ./test_progs -t "sockmap_listen/sockmap VSOCK test_vsock_redir"
> > > 
> > > 
> > > 
> > > I guess the problem is in sk_psock_backlog, where skb is inserted to another
> > > list before skb_dequeue is called.
> > > 
> > > So I tested it with the following changes, and found the problem did go away.
> > > 
> > > --- a/net/core/skmsg.c
> > > +++ b/net/core/skmsg.c
> > > @@ -648,7 +648,7 @@ static void sk_psock_backlog(struct work_struct *work)
> > >                  off = state->off;
> > >          }
> > > 
> > > -       while ((skb = skb_peek(&psock->ingress_skb))) {
> > > +       while ((skb = skb_dequeue(&psock->ingress_skb))) {
> > >                  len = skb->len;
> > >                  off = 0;
> > >                  if (skb_bpf_strparser(skb)) {
> > > @@ -684,7 +684,6 @@ static void sk_psock_backlog(struct work_struct *work)
> > >                          len -= ret;
> > >                  } while (len);
> > > 
> > > -               skb = skb_dequeue(&psock->ingress_skb);
> > >                  if (!ingress) {
> > >                          kfree_skb(skb);
> > >                  }
> > > 
> > > Not clear what exactly happened, needs more debugging.
> > >
> 
> I can only reproduce this on bpf-next so specific to
> the vsock use case?
> 
> > 
> > Use the skb address obtained from skb_peek() in sk_psock_backlog() as the key,
> > 4 stack traces are obtained.
> > 
> > 
> > trace 0, the skb is queued to the target socket ingress queue:
> > 
> > [  120.042016] sk_psock_skb_ingress_enqueue+0xf5/0x160
> > [  120.045052] sk_psock_backlog+0x206/0x400
> > [  120.047366] process_one_work+0x292/0x560
> > [  120.049657] worker_thread+0x53/0x3e0
> > [  120.051698] kthread+0x102/0x130
> > [  120.053497] ret_from_fork+0x34/0x50
> > [  120.055528] ret_from_fork_asm+0x1b/0x30
> > 
> > 
> > trace 1, the skb is consumed by the user:
> > 
> > [  120.061537] consume_skb+0x47/0x100
> > [  120.063394] sk_msg_recvmsg+0x268/0x3e0
> > [  120.065458] unix_bpf_recvmsg+0x16c/0x610
> > [  120.067584] unix_stream_recvmsg+0x66/0xa0
> > [  120.069946] sock_recvmsg+0xc4/0xd0
> > [  120.072063] __sys_recvfrom+0xaf/0x120
> > [  120.073933] __x64_sys_recvfrom+0x29/0x30
> > [  120.076052] do_syscall_64+0x3f/0x90
> > [  120.077986] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > 
> > trace 2, the vsock socket is closed by the user, and a new skb with
> > the same address is allocated in vsock_release:
> > 
> > [  120.084296] __alloc_skb+0xe3/0x180
> > [  120.086335] virtio_transport_alloc_skb+0x3b/0x2c0
> > [  120.089174] virtio_transport_send_pkt_info+0x118/0x230
> > [  120.092191] virtio_transport_release+0x29d/0x400
> > [  120.094845] __vsock_release+0x3c/0x1e0
> > [  120.096905] vsock_release+0x18/0x30
> > [  120.098899] __sock_release+0x3d/0xc0
> > [  120.100885] sock_close+0x18/0x20
> > [  120.102606] __fput+0x108/0x2b0
> > [  120.104636] task_work_run+0x5d/0xa0
> > [  120.106876] exit_to_user_mode_prepare+0x18c/0x190
> > [  120.109619] syscall_exit_to_user_mode+0x1d/0x50
> > [  120.112049] do_syscall_64+0x4c/0x90
> > [  120.114115] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > 
> > trace 3, sk_psock_backlog() calls skb_dequeue() to unlink the skb, since
> > this skb is now actually a new skb allocated in vsock_release, its prev
> > and next fields are both set to NULL, NULL deref occurs.
> > 
> > [  120.120619] skb_dequeue+0x92/0xa0
> > [  120.122350] sk_psock_backlog+0x305/0x400
> > [  120.124512] process_one_work+0x292/0x560
> > [  120.126771] worker_thread+0x53/0x3e0
> > [  120.128843] kthread+0x102/0x130
> > [  120.130772] ret_from_fork+0x34/0x50
> > 
> > To fix it, it seems reasonable to replace skb_peek() with skb_dequeue()
> > in sk_psock_backlog(), since we can't prevent the skb from being appended
> > to an ingress queue and consumed by user, as shown in trace 1 and trace 2.
> 
> The trouble with skb_dequeue is it breaks other checks that check
> the backlog queue length. It really is nice to have a single len
> check that determines if backlog is necessary or not.
> 
> If we revert something we likely need to go back to holding the
> sock lock in backlog to ensure a reader can't eat the skb while
> We still have a reference to it. It wasn't an issue for us because
> its the exception case.
> 
> Trying to come up with some nice fix now.

Something like this it fixes the splat, but need to think if it
introduces anything or some better way to do this. Basic idea
is to bump user->refcnt because we have two references to the
skb and want to ensure we really only kfree_skb() the skb
after both references are dropped.

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index a0659fc29bcc..6c31eefbd777 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -612,12 +612,18 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
                               u32 off, u32 len, bool ingress)
 {
+       int err = 0;
+
        if (!ingress) {
                if (!sock_writeable(psock->sk))
                        return -EAGAIN;
                return skb_send_sock(psock->sk, skb, off, len);
        }
-       return sk_psock_skb_ingress(psock, skb, off, len);
+       skb_get(skb);
+       err = sk_psock_skb_ingress(psock, skb, off, len);
+       if (err < 0)
+               kfree_skb(skb);
+       return err;
 }
 
 static void sk_psock_skb_state(struct sk_psock *psock,
@@ -685,9 +691,7 @@ static void sk_psock_backlog(struct work_struct *work)
                } while (len);
 
                skb = skb_dequeue(&psock->ingress_skb);
-               if (!ingress) {
-                       kfree_skb(skb);
-               }
+               kfree_skb(skb);
        }
 end:
        mutex_unlock(&psock->work_mutex);