netdev - Re: REGRESSION: RIP: 0010:skb_release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK8fFZ5j_T1NzoOEfqE1HYhAEhD04smR4OT2bnMEAr+2+6C5RQ@mail.gmail.com>
Date: Tue, 26 Mar 2024 14:25:53 +0100
From: Jaroslav Pulchart <jaroslav.pulchart@...ddata.com>
To: Jason Wang <jasowang@...hat.com>
Cc: Igor Raits <igor@...ddata.com>, Stefan Hajnoczi <stefanha@...hat.com>, kvm@...r.kernel.org, 
	virtualization@...ts.linux.dev, netdev@...r.kernel.org, 
	Stefano Garzarella <sgarzare@...hat.com>, "Michael S. Tsirkin" <mst@...hat.com>
Subject: Re: REGRESSION: RIP: 0010:skb_release_data+0xb8/0x1e0 in vhost/tun

>
> On Mon, Mar 25, 2024 at 4:44 PM Igor Raits <igor@...ddata.com> wrote:
> >
> > Hello,
> >
> > On Fri, Mar 22, 2024 at 12:19 PM Igor Raits <igor@...ddata.com> wrote:
> > >
> > > Hi Jason,
> > >
> > > On Fri, Mar 22, 2024 at 9:39 AM Igor Raits <igor@...ddata.com> wrote:
> > > >
> > > > Hi Jason,
> > > >
> > > > On Fri, Mar 22, 2024 at 6:31 AM Jason Wang <jasowang@...hat.com> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 5:44 PM Igor Raits <igor@...ddata.com> wrote:
> > > > > >
> > > > > > Hello Jason & others,
> > > > > >
> > > > > > On Wed, Mar 20, 2024 at 10:33 AM Jason Wang <jasowang@...hat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Mar 19, 2024 at 9:15 PM Igor Raits <igor@...ddata.com> wrote:
> > > > > > > >
> > > > > > > > Hello Stefan,
> > > > > > > >
> > > > > > > > On Tue, Mar 19, 2024 at 2:12 PM Stefan Hajnoczi <stefanha@...hat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Mar 19, 2024 at 10:00:08AM +0100, Igor Raits wrote:
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > We have started to observe kernel crashes on 6.7.y kernels (atm we
> > > > > > > > > > have hit the issue 5 times on 6.7.5 and 6.7.10). On 6.6.9 where we
> > > > > > > > > > have nodes of cluster it looks stable. Please see stacktrace below. If
> > > > > > > > > > you need more information please let me know.
> > > > > > > > > >
> > > > > > > > > > We do not have a consistent reproducer but when we put some bigger
> > > > > > > > > > network load on a VM, the hypervisor's kernel crashes.
> > > > > > > > > >
> > > > > > > > > > Help is much appreciated! We are happy to test any patches.
> > > > > > > > >
> > > > > > > > > CCing Michael Tsirkin and Jason Wang for vhost_net.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [62254.167584] stack segment: 0000 [#1] PREEMPT SMP NOPTI
> > > > > > > > > > [62254.173450] CPU: 63 PID: 11939 Comm: vhost-11890 Tainted: G
> > > > > > > > > >    E      6.7.10-1.gdc.el9.x86_64 #1
> > > > > > > > >
> > > > > > > > > Are there any patches in this kernel?
> > > > > > > >
> > > > > > > > Only one, unrelated to this part. Removal of pr_err("EEVDF scheduling
> > > > > > > > fail, picking leftmost\n"); line (reported somewhere few months ago
> > > > > > > > and it was suggested workaround until proper solution comes).
> > > > > > >
> > > > > > > Btw, a bisection would help as well.
> > > > > >
> > > > > > In the end it seems like we don't really have "stable" setup, so
> > > > > > bisection looks to be useless but we did find few things meantime:
> > > > > >
> > > > > > 1. On 6.6.9 it crashes either with unexpected GSO type or usercopy:
> > > > > > Kernel memory exposure attempt detected from SLUB object
> > > > > > 'skbuff_head_cache'
> > > > >
> > > > > Do you have a full calltrace for this?
> > > >
> > > > I have shared it in one of the messages in this thread.
> > > > https://marc.info/?l=linux-virtualization&m=171085443512001&w=2
> > > >
> > > > > > 2. On 6.7.5, 6.7.10 and 6.8.1 it crashes with RIP:
> > > > > > 0010:skb_release_data+0xb8/0x1e0
> > > > >
> > > > > And for this?
> > > >
> > > > https://marc.info/?l=linux-netdev&m=171083870801761&w=2
> > > >
> > > > > > 3. It does NOT crash on 6.8.1 when VM does not have multi-queue setup
> > > > > >
> > > > > > Looks like the multi-queue setup (we have 2 interfaces × 3 virtio
> > > > > > queues for each) is causing problems as if we set only one queue for
> > > > > > each interface the issue is gone.
> > > > > > Maybe there is some race condition in __pfx_vhost_task_fn+0x10/0x10 or
> > > > > > somewhere around?
> > > > >
> > > > > I can't tell now, but it seems not because if we have 3 queue pairs we
> > > > > will have 3 vhost threads.
> > > > >
> > > > > > We have noticed that there are 3 of such functions
> > > > > > in the stacktrace that gave us hints about what we could try…
> > > > >
> > > > > Let's try to enable SLUB_DEBUG and KASAN to see if we can get
> > > > > something interesting.
> > > >
> > > > We were able to reproduce it even with 1 vhost queue... And now we
> > > > have slub_debug + kasan so I hopefully have more useful data for you
> > > > now.
> > > > I have attached it for better readability.
> > >
> > > Looks like we have found a "stable" kernel and that is 6.1.32. The
> > > 6.3.y is broken and we are testing 6.2.y now.
> > > My guess it would be related to virtio/vsock: replace virtio_vsock_pkt
> > > with sk_buff that was done around that time but we are going to test,
> > > bisect and let you know more.
> >
> > So we have been trying to bisect it but it is basically impossible for
> > us to do so as the ICE driver was quite broken for most of the release
> > cycle so we have no networking on 99% of the builds and we can't test
> > such a setup.
> > More specifically, the bug was introduced between 6.2 and 6.3 but we
> > could not get much further. The last good commit we were able to test
> > was f18f9845f2f10d3d1fc63e4ad16ee52d2d9292fa and then after 20 commits
> > where we had no networking we gave up.
> >
> > If you have some suspicious commit(s) we could revert - happy to test.
>
> Here is the is for the change since f18f9845f2f10d3d1fc63e4ad16ee52d2d9292fa:
>
> cbfbfe3aee71 tun: prevent negative ifindex
> b2f8323364ab tun: add __exit annotations to module exit func tun_cleanup()
> 6231e47b6fad tun: avoid high-order page allocation for packet header
> 4d016ae42efb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> 59eeb2329405 drivers: net: prevent tun_build_skb() to exceed the
> packet size limit
> 35b1b1fd9638 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> ce7c7fef1473 net: tun: change tun_alloc_skb() to allow bigger paged allocations
> 9bc3047374d5 net: tun_chr_open(): set sk_uid from current_fsuid()
> 82b2bc279467 tun: Fix memory leak for detached NAPI queue.
> 6e98b09da931 Merge tag 'net-next-6.4' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> de4f5fed3f23 iov_iter: add iter_iovec() helper
> 438b406055cd tun: flag the device as supporting FMODE_NOWAIT
> de4287336794 Daniel Borkmann says:
> a096ccca6e50 tun: tun_chr_open(): correctly initialize socket uid
> 66c0e13ad236 drivers: net: turn on XDP features
>
> The commit that touches the datapath are:
>
> 6231e47b6fad tun: avoid high-order page allocation for packet header
> 59eeb2329405 drivers: net: prevent tun_build_skb() to exceed the
> packet size limit
> ce7c7fef1473 net: tun: change tun_alloc_skb() to allow bigger paged allocations
> 82b2bc279467 tun: Fix memory leak for detached NAPI queue.
> de4f5fed3f23 iov_iter: add iter_iovec() helper
>
> I assume you didn't use NAPI mode, so 82b2bc279467 tun: Fix memory
> leak for detached NAPI queue doesn't make sense for us.
>
> The rest might be the bad commit if it is caused by a change of tun itself.
>
> btw I vaguely remember KASAN will report who did the allocation and
> who did the free. But it seems not in your KASAN log.
>
> Thanks
>
> >
> > Thanks again.
> >
>

Hello

We have one observation. The occurrence of the error depends on the
ring buffer size of physical network cards. We have two E810 Intel
cards bonded by two interfaces (em1 + p3p2, ice driver) into single
bon0. The bond0 is then linux bridged and/or ovs(witched) to VMs via
tun interfaces (both switch solutions have the same problem). VMs are
qemu-kvm instances and using vhost/virtio-net.

We see:
1/ The issue is triggered almost instantaneously when tx/rx ring
buffer is set to 2048 (our default)
ethtool -G em1 rx 2048 tx 2048
ethtool -G p3p1 rx 2048 tx 2048

2/ Similar issue is triggered when the tx/rx ring buffer is set to
4096: the host does not crash immediately, but some trace is shown
soon and later it gets into memory pressure and crashes.
ethtool -G em1 rx 4096 tx 4096
ethtool -G p3p1 rx 4096 tx 4096
See attached ring_4096.kasan.txt (vanila 6.8.1 with enabled KASAN) and
ring_4096.txt (vanila 6.8.1 without kasan)

3/ The system is stable or we just can-not trigger the issue if the
ring buffer is >= 6144.
ethtool -G em1 rx 7120 tx 7120
ethtool -G p3p1 rx 7120 tx 7120

could it be influenced by a some rate of dropped packets in the ring buffer?

# for i in em1 p3p1; do ethtool -S ${i} | grep dropped.nic; done
     rx_dropped.nic: 158225
     rx_dropped.nic: 74285

Best,
Jaroslav Pulchart

View attachment "ring_4096.kasan.txt" of type "text/plain" (4877 bytes)

View attachment "ring_4096.txt" of type "text/plain" (2335 bytes)