lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250725024454.690517-1-kuba@kernel.org>
Date: Thu, 24 Jul 2025 19:44:54 -0700
From: Jakub Kicinski <kuba@...nel.org>
To: davem@...emloft.net
Cc: netdev@...r.kernel.org,
	edumazet@...gle.com,
	pabeni@...hat.com,
	andrew+netdev@...n.ch,
	horms@...nel.org,
	Jakub Kicinski <kuba@...nel.org>,
	Jason Wang <jasowang@...hat.com>,
	Zigit Zo <zuozhijie@...edance.com>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	Xuan Zhuo <xuanzhuo@...ux.alibaba.com>,
	Eugenio Pérez <eperezma@...hat.com>,
	leitao@...ian.org,
	sdf@...ichev.me
Subject: [PATCH net] netpoll: prevent hanging NAPI when netcons gets enabled

Paolo spotted hangs in NIPA running driver tests against virtio.
The tests hang in virtnet_close() -> virtnet_napi_tx_disable().

The problem is only reproducible if running multiple of our tests
in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
netpoll_basic.py stats.py"). Initial suspicion was that this is
a simple case of double-disable of NAPI, but instrumenting the
code reveals:

 Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx):
   state: 0x37, disabled: false, owner: 0, listed: false, weight: 64

The NAPI was not in fact disabled, owner is 0 (rather than -1),
so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed
(!list_empty(&n->poll_list)). It seems odd that normal NAPI
processing would wedge itself like this.

My suspicion is that netpoll gets enabled while NAPI is polling,
and also grab the NAPI instance. This confuses napi_complete_done():

  [netpoll]                                   [normal NAPI]
                                        napi_poll()
                                          have = netpoll_poll_lock()
                                            rcu_access_pointer(dev->npinfo)
                                              return NULL # no netpoll
                                          __napi_poll()
					    ->poll(->weight)
  poll_napi()
    cmpxchg(->poll_owner, -1, cpu)
      poll_one_napi()
        set_bit(NAPI_STATE_NPSVC, ->state)
                                              napi_complete_done()
                                                if (NAPIF_STATE_NPSVC)
                                                  return false
                                           # exit without clearing SCHED

This seems very unlikely, but perhaps virtio has some interactions
with the hypervisor in the NAPI -> poll that makes the race window
large?

Best I could to to prove the theory was to add and trigger this
warning in napi_poll (just before netpoll_poll_unlock()):

      WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) &&
                napi_is_scheduled(n) && list_empty(&n->poll_list),
                "NAPI race with netpoll %px", n);

If this warning hits the next virtio_close() will hang.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Paolo Abeni <pabeni@...hat.com>
Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com
Signed-off-by: Jakub Kicinski <kuba@...nel.org>
---
Looks like this is not a new bug, rather Breno's tests now put
enough pressure on netpoll + virtio to trigger it.

CC: Jason Wang <jasowang@...hat.com>
CC: Zigit Zo <zuozhijie@...edance.com>
CC: "Michael S. Tsirkin" <mst@...hat.com>
CC: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
CC: Eugenio Pérez <eperezma@...hat.com>

CC: leitao@...ian.org
CC: sdf@...ichev.me
---
 drivers/net/netconsole.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index e3722de08ea9..9bc748ff5752 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -256,6 +256,24 @@ static struct netconsole_target *alloc_and_init(void)
 	return nt;
 }
 
+static int netconsole_setup_and_enable(struct netconsole_target *nt)
+{
+	int ret;
+
+	ret = netpoll_setup(&nt->np);
+	if (ret)
+		return ret;
+
+	/* Make sure all NAPI polls which started before dev->npinfo
+	 * was visible have exited before we start calling NAPI poll.
+	 * NAPI skips locking if dev->npinfo is NULL.
+	 */
+	synchronize_rcu();
+	nt->enabled = true;
+
+	return 0;
+}
+
 /* Clean up every target in the cleanup_list and move the clean targets back to
  * the main target_list.
  */
@@ -574,11 +592,10 @@ static ssize_t enabled_store(struct config_item *item,
 		 */
 		netconsole_print_banner(&nt->np);
 
-		ret = netpoll_setup(&nt->np);
+		ret = netconsole_setup_and_enable(nt);
 		if (ret)
 			goto out_unlock;
 
-		nt->enabled = true;
 		pr_info("network logging started\n");
 	} else {	/* false */
 		/* We need to disable the netconsole before cleaning it up
@@ -1889,7 +1906,7 @@ static struct netconsole_target *alloc_param_target(char *target_config,
 	if (err)
 		goto fail;
 
-	err = netpoll_setup(&nt->np);
+	err = netconsole_setup_and_enable(nt);
 	if (err) {
 		pr_err("Not enabling netconsole for %s%d. Netpoll setup failed\n",
 		       NETCONSOLE_PARAM_TARGET_PREFIX, cmdline_count);
@@ -1898,8 +1915,6 @@ static struct netconsole_target *alloc_param_target(char *target_config,
 			 * otherwise, keep the target in the list, but disabled.
 			 */
 			goto fail;
-	} else {
-		nt->enabled = true;
 	}
 	populate_configfs_item(nt, cmdline_count);
 
-- 
2.50.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ