netdev - Re: vhost: linux-next: crash at vhost_dev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGxU2F5Qy=vMD0z9_HTN2K9wyt+6EH-Yr0N9VqR4OT4O1asqZg@mail.gmail.com>
Date: Thu, 24 Jul 2025 14:52:08 +0200
From: Stefano Garzarella <sgarzare@...hat.com>
To: Breno Leitao <leitao@...ian.org>
Cc: Will Deacon <will@...nel.org>, "Michael S. Tsirkin" <mst@...hat.com>, jasowang@...hat.com, 
	eperezma@...hat.com, linux-arm-kernel@...ts.infradead.org, 
	kvm@...r.kernel.org, Stefan Hajnoczi <stefanha@...hat.com>, netdev@...r.kernel.org
Subject: Re: vhost: linux-next: crash at vhost_dev_cleanup()

On Thu, 24 Jul 2025 at 14:48, Breno Leitao <leitao@...ian.org> wrote:
>
> On Thu, Jul 24, 2025 at 09:44:38AM +0100, Will Deacon wrote:
> > > > On Thu, 24 Jul 2025 at 09:48, Michael S. Tsirkin <mst@...hat.com> wrote:
> > > > >
> > > > > On Wed, Jul 23, 2025 at 08:04:42AM -0700, Breno Leitao wrote:
> > > > > > Hello,
> > > > > >
> > > > > > I've seen a crash in linux-next for a while on my arm64 server, and
> > > > > > I decided to report.
> > > > > >
> > > > > > While running stress-ng on linux-next, I see the crash below.
> > > > > >
> > > > > > This is happening in a kernel configure with some debug options (KASAN,
> > > > > > LOCKDEP and KMEMLEAK).
> > > > > >
> > > > > > Basically running stress-ng in a loop would crash the host in 15-20
> > > > > > minutes:
> > > > > >       # while (true); do stress-ng -r 10 -t 10; done
> > > > > >
> > > > > > >From the early warning "virt_to_phys used for non-linear address",
> > > >
> > > > mmm, we recently added nonlinear SKBs support in vhost-vsock [1],
> > > > @Will can this issue be related?
> > >
> > > Good point.
> > >
> > > Breno, if bisecting is too much trouble, would you mind testing the commits
> > > c76f3c4364fe523cd2782269eab92529c86217aa
> > > and
> > > c7991b44d7b44f9270dec63acd0b2965d29aab43
> > > and telling us if this reproduces?
> >
> > That's definitely worth doing, but we should be careful not to confuse
> > the "non-linear address" from the warning (which refers to virtual
> > addresses that lie outside of the linear mapping of memory, e.g. in the
> > vmalloc space) and "non-linear SKBs" which refer to SKBs with fragment
> > pages.
>
> I've tested both commits above, and I see the crash on both commits
> above, thus, the problem reproduces in both cases. The only difference
> I noted is the fact that I haven't seen the warning before the crash.
>
>
> Log against c76f3c4364fe ("vhost/vsock: Avoid allocating
> arbitrarily-sized SKBs")
>
>          Unable to handle kernel paging request at virtual address 0000001fc0000048
>          Mem abort info:
>            ESR = 0x0000000096000005
>            EC = 0x25: DABT (current EL), IL = 32 bits
>            SET = 0, FnV = 0
>            EA = 0, S1PTW = 0
>            FSC = 0x05: level 1 translation fault
>          Data abort info:
>            ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>            CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>            GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>          user pgtable: 64k pages, 48-bit VAs, pgdp=0000000cdcf2da00
>          [0000001fc0000048] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
>          Internal error: Oops: 0000000096000005 [#1]  SMP
>          Modules linked in: vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic unix_diag vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf nvidia_c
>          CPU: 34 UID: 0 PID: 1727297 Comm: stress-ng-dev Kdump: loaded Not tainted 6.16.0-rc6-upstream-00027-gc76f3c4364fe #19 NONE
>          pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>          pc : kfree+0x48/0x2a8
>          lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
>          sp : ffff80013a0cfcd0
>          x29: ffff80013a0cfcd0 x28: ffff0008fd0b6240 x27: 0000000000000000
>          x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
>          x23: 00000000040e001f x22: ffffffffffffffff x21: ffff00014f1d4ac0
>          x20: 0000000000000001 x19: ffff00014f1d0000 x18: 0000000000000000
>          x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>          x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
>          x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
>          x8 : 0000001fc0000040 x7 : 0000000000000000 x6 : 0000000000000000
>          x5 : ffff000141931840 x4 : 0000000000000000 x3 : 0000000000000008
>          x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 0000000000010000
>          Call trace:
>           kfree+0x48/0x2a8 (P)
>           vhost_dev_cleanup+0x138/0x2b8 [vhost]
>           vhost_net_release+0xa0/0x1a8 [vhost_net]

But here is the vhost_net, so I'm confused now.
Do you see the same (vhost_net) also on 9798752 ("Add linux-next
specific files for 20250721") ?

The initial report contained only vhost_vsock traces IIUC, so I'm
suspecting something in the vhost core.

Thanks,
Stefano

>           __fput+0xfc/0x2f0
>           fput_close_sync+0x38/0xc8
>           __arm64_sys_close+0xb4/0x108
>           invoke_syscall+0x4c/0xd0
>           do_el0_svc+0x80/0xb0
>           el0_svc+0x3c/0xd0
>           el0t_64_sync_handler+0x70/0x100
>           el0t_64_sync+0x170/0x178
>          Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)
>
> Log against c7991b44d7b4 ("vsock/virtio: Allocate nonlinear SKBs for
> handling large transmit buffers")
>
>         Unable to handle kernel paging request at virtual address 0010502f8f8f4f08
>         Mem abort info:
>           ESR = 0x0000000096000004
>           EC = 0x25: DABT (current EL), IL = 32 bits
>           SET = 0, FnV = 0
>           EA = 0, S1PTW = 0
>           FSC = 0x04: level 0 translation fault
>         Data abort info:
>           ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
>           CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>           GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>         [0010502f8f8f4f08] address between user and kernel address ranges
>         Internal error: Oops: 0000000096000004 [#1]  SMP
>         Modules linked in: vhost_vsock vfio_iommu_type1 vfio md4 crc32_cryptoapi ghash_generic vhost_net tun vhost vhost_iotlb tap mpls_gso mpls_iptunnel mpls_router fou sch_fq ghes_edac tls tcp_diag inet_diag act_gact cls_bpf ipmi_s
>         CPU: 47 UID: 0 PID: 1239699 Comm: stress-ng-dev Kdump: loaded Tainted: G        W           6.16.0-rc6-upstream-00035-gc7991b44d7b4 #18 NONE
>         Tainted: [W]=WARN
>         pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>         pc : kfree+0x48/0x2a8
>         lr : vhost_dev_cleanup+0x138/0x2b8 [vhost]
>         sp : ffff80016c0cfcd0
>         x29: ffff80016c0cfcd0 x28: ffff001ad6210d80 x27: 0000000000000000
>         x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
>         x23: 00000000040e001f x22: ffffffffffffffff x21: ffff001bb76f00c0
>         x20: 0000000000000000 x19: ffff001bb76f0000 x18: 0000000000000000
>         x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
>         x14: 000000000000001f x13: 000000000000000f x12: 0000000000000001
>         x11: 0000000000000000 x10: 0000000000000402 x9 : ffffffdfc0000000
>         x8 : 0010502f8f8f4f00 x7 : 0000000000000000 x6 : 0000000000000000
>         x5 : ffff00012e7e2128 x4 : 0000000000000000 x3 : 0000000000000008
>         x2 : ffffffffffffffff x1 : ffffffffffffffff x0 : 41403f3e3d3c3b3a
>         Call trace:
>          kfree+0x48/0x2a8 (P)
>          vhost_dev_cleanup+0x138/0x2b8 [vhost]
>          vhost_net_release+0xa0/0x1a8 [vhost_net]
>          __fput+0xfc/0x2f0
>          fput_close_sync+0x38/0xc8
>          __arm64_sys_close+0xb4/0x108
>          invoke_syscall+0x4c/0xd0
>          do_el0_svc+0x80/0xb0
>          el0_svc+0x3c/0xd0
>          el0t_64_sync_handler+0x70/0x100
>          el0t_64_sync+0x170/0x178
>         Code: 8b080008 f2dffbe9 d350fd08 8b081928 (f9400509)
>
>
> > Breno -- when you say you've been seeing this "for a while", what's the
> > earliest kernel you know you saw it on?
>
> Looking at my logs, the older kernel that I saw it was net-next from
> 20250717, which was around the time I decided to test net-next in
> preparation for 6.17, so, not very helpful. Sorry.
>