netdev - Re: virtio_close() stuck on napi_disable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CACGkMEsnKwYqRi_=s4Uy8x5b2M8WXXzmPV3tOf1Qh-7MG-KNDQ@mail.gmail.com>
Date: Wed, 23 Jul 2025 13:14:38 +0800
From: Jason Wang <jasowang@...hat.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Paolo Abeni <pabeni@...hat.com>, Zigit Zo <zuozhijie@...edance.com>, 
	"Michael S. Tsirkin" <mst@...hat.com>, Xuan Zhuo <xuanzhuo@...ux.alibaba.com>, 
	Eugenio Pérez <eperezma@...hat.com>, 
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: virtio_close() stuck on napi_disable_locked()

On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@...nel.org> wrote:
>
> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> > Hi,
> >
> > The NIPA CI is reporting some hung-up in the stats.py test caused by the
> > virtio_net driver stuck at close time.
> >
> > A sample splat is available here:
> >
> > https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> >
> > AFAICS the issue happens only on debug builds.
> >
> > I'm wild guessing to something similar to the the issue addressed by
> > commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> > but I could not spot anything obvious.
> >
> > Could you please have a look?
>
> It only hits in around 1 in 5 runs.

I tried to reproduce this locally but failed. Where can I see the qemu
command line for the VM?

> Likely some pre-existing race, but
> it started popping up for us when be5dcaed694e ("virtio-net: fix
> recursived rtnl_lock() during probe()") was merged.

Probably but I didn't see a direct connection with that commit. It
looks like the root cause is the deadloop of napi_disable() for some
reason as Paolo said.

> It never hit before.
> If we can't find a quick fix I think we should revert be5dcaed694e for
> now, so that it doesn't end up regressing 6.16 final.
>

Thanks