[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3b92595e-c426-4b90-8905-8ba75e7f722a@redhat.com>
Date: Thu, 24 Jul 2025 13:29:03 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Jason Wang <jasowang@...hat.com>
Cc: Jakub Kicinski <kuba@...nel.org>, Zigit Zo <zuozhijie@...edance.com>,
"Michael S. Tsirkin" <mst@...hat.com>, Xuan Zhuo
<xuanzhuo@...ux.alibaba.com>, Eugenio Pérez
<eperezma@...hat.com>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: virtio_close() stuck on napi_disable_locked()
On 7/24/25 12:53 PM, Jason Wang wrote:
> On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@...hat.com> wrote:
>> On 7/23/25 7:14 AM, Jason Wang wrote:
>>> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@...nel.org> wrote:
>>>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>>>> virtio_net driver stuck at close time.
>>>>>
>>>>> A sample splat is available here:
>>>>>
>>>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>>>
>>>>> AFAICS the issue happens only on debug builds.
>>>>>
>>>>> I'm wild guessing to something similar to the the issue addressed by
>>>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>>>> but I could not spot anything obvious.
>>>>>
>>>>> Could you please have a look?
>>>>
>>>> It only hits in around 1 in 5 runs.
>>>
>>> I tried to reproduce this locally but failed. Where can I see the qemu
>>> command line for the VM?
>>>
>>>> Likely some pre-existing race, but
>>>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>>>> recursived rtnl_lock() during probe()") was merged.
>>>
>>> Probably but I didn't see a direct connection with that commit. It
>>> looks like the root cause is the deadloop of napi_disable() for some
>>> reason as Paolo said.
>>>
>>>> It never hit before.
>>>> If we can't find a quick fix I think we should revert be5dcaed694e for
>>>> now, so that it doesn't end up regressing 6.16 final.
>>
>> I tried hard to reproduce the issue locally - to validate an eventual
>> revert before pushing it. But so far I failed quite miserably.
>>
>
> I've also tried to follow the instructions of nipai for setup 2 virtio
> and make the relevant taps to connect with a bridge on the host. But I
> failed to reproduce it locally for several hours.
>
> Is there a log of the execution of nipa test that we can know more
> information like:
>
> 1) full qemu command line
I guess it could depend on vng version; here I'm getting:
qemu-system-x86_64 -name virtme-ng -m 1G -chardev
socket,id=charvirtfs5,path=/tmp/virtmebyfqshp5 -device
vhost-user-fs-device,chardev=charvirtfs5,tag=ROOTFS -object
memory-backend-memfd,id=mem,size=1G,share=on -numa node,memdev=mem
-machine accel=kvm:tcg -M microvm,accel=kvm,pcie=on,rtc=on -cpu host
-parallel none -net none -echr 1 -chardev
file,path=/proc/self/fd/2,id=dmesg -device virtio-serial-device -device
virtconsole,chardev=dmesg -chardev stdio,id=console,signal=off,mux=on
-serial chardev:console -mon chardev=console -vga none -display none
-smp 4 -kernel ./arch/x86/boot/bzImage -append virtme_hostname=virtme-ng
nr_open=1073741816
virtme_link_mods=/data/net-next/.virtme_mods/lib/modules/0.0.0
virtme_rw_overlay0=/etc virtme_rw_overlay1=/lib virtme_rw_overlay2=/home
virtme_rw_overlay3=/opt virtme_rw_overlay4=/srv virtme_rw_overlay5=/usr
virtme_rw_overlay6=/var virtme_rw_overlay7=/tmp console=hvc0
earlyprintk=serial,ttyS0,115200 virtme_console=ttyS0 psmouse.proto=exps
"virtme_stty_con=rows 32 cols 136 iutf8" TERM=xterm-256color
virtme_chdir=data/net-next virtme_root_user=1 rootfstype=virtiofs
root=ROOTFS raid=noautodetect ro debug
init=/usr/lib/python3.13/site-packages/virtme/guest/bin/virtme-ng-init
-device
virtio-net-pci,netdev=n0,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n0,ifname=tap0,vhost=on,script=no,downscript=no,queues=8
-device
virtio-net-pci,netdev=n1,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n1,ifname=tap1,vhost=on,script=no,downscript=no,queues=8
I guess the significant part is: ' -smp 4 -m 1G'. The networking bits
are the verbatim configuration from the wiki.
> 2) host kernel version
I can give a reasonably sure answer only for this point; the kernel is
'current' net-next with 'current' net merged in and all the patches
pending on patchwork merged in, too.
For a given tests iteration nipa provides a snapshot of the patches
merged in, i.e. for tests run on 2025/07/24 at 00:00 see:
https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-hw-2025-07-24--00-00.html
> 3) Qemu version
Should be a stock, recent ubuntu build.
/P
Powered by blists - more mailing lists