lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20251102032105.772670-1-marmarek@invisiblethingslab.com>
Date: Sun,  2 Nov 2025 04:20:12 +0100
From: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
To: linux-kernel@...r.kernel.org
Cc: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>,
	Juergen Gross <jgross@...e.com>,
	Stefano Stabellini <sstabellini@...nel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
	Peng Jiang <jiang.peng9@....com.cn>,
	Qiu-ji Chen <chenqiuji666@...il.com>,
	Jason Andryuk <jason.andryuk@....com>,
	xen-devel@...ts.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Subject: [PATCH] xen/xenbus: better handle backend crash

When the backend domain crashes, coordinated device cleanup is not
possible (as it involves waiting for the backend state change). In that
case, toolstack forcefully removes frontend xenstore entries.
xenbus_dev_changed() handles this case, and triggers device cleanup.
It's possible that toolstack manages to connect new device in that
place, before xenbus_dev_changed() notices the old one is missing. If
that happens, new one won't be probed and will forever remain in
XenbusStateInitialising.

Fix this by checking backend-id and if it changes, consider it
unplug+plug operation. It's important that cleanup on such unplug
doesn't modify xenstore entries (especially the "state" key) as it
belong to the new device to be probed - changing it would derail
establishing connection to the new backend (most likely, closing the
device before it was even connected). Handle this case by setting new
xenbus_device->vanished flag to true, and check it before changing state
entry.

And even if xenbus_dev_changed() correctly detects the device was
forcefully removed, the cleanup handling is still racy. Since this whole
handling doesn't happend in a single xenstore transaction, it's possible
that toolstack might put a new device there already. Avoid re-creating
the state key (which in the case of loosing the race would actually
close newly attached device).

The problem does not apply to frontend domain crash, as this case
involves coordinated cleanup.

Problem originally reported at
https://lore.kernel.org/xen-devel/aOZvivyZ9YhVWDLN@mail-itl/T/#t,
including reproduction steps.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
---
I considered re-using one of existing fields instead of a new
xenbus_device->vanished, but I wasn't sure if that would work better.
Setting xenbus_device->nodename to NULL would prevent few other places
using it (including some log messages). Setting xenbus_device->otherend
might have less unintentional impact, but logically it doesn't feel
correct.

With this patch applied, I cannot reproduce the issue anymore - neither
with the simplified reproducer script, nor with the full test suite.
---
 drivers/xen/xenbus/xenbus_client.c |  2 ++
 drivers/xen/xenbus/xenbus_probe.c  | 25 +++++++++++++++++++++++++
 include/xen/xenbus.h               |  1 +
 3 files changed, 28 insertions(+)

diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index e73ec225d4a61..ce2f49d9aa4ad 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -275,6 +275,8 @@ __xenbus_switch_state(struct xenbus_device *dev,
  */
 int xenbus_switch_state(struct xenbus_device *dev, enum xenbus_state state)
 {
+	if (dev->vanished)
+		return 0;
 	return __xenbus_switch_state(dev, state, 0);
 }
 
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 86fe6e7790566..3c3e56b544976 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -444,6 +444,9 @@ static void xenbus_cleanup_devices(const char *path, struct bus_type *bus)
 		info.dev = NULL;
 		bus_for_each_dev(bus, NULL, &info, cleanup_dev);
 		if (info.dev) {
+			dev_warn(&info.dev->dev,
+			         "device forcefully removed from xenstore\n");
+			info.dev->vanished = true;
 			device_unregister(&info.dev->dev);
 			put_device(&info.dev->dev);
 		}
@@ -659,6 +662,28 @@ void xenbus_dev_changed(const char *node, struct xen_bus_type *bus)
 		return;
 
 	dev = xenbus_device_find(root, &bus->bus);
+	/* Backend domain crash results in not coordinated frontend removal,
+	 * without going through XenbusStateClosing. Check if the device
+	 * wasn't replaced to point at another backend in the meantime.
+	 */
+	if (dev && !strncmp(node, "device/", sizeof("device/")-1)) {
+		int backend_id;
+		int err = xenbus_gather(XBT_NIL, root,
+				        "backend-id", "%i", &backend_id,
+					NULL);
+		if (!err && backend_id != dev->otherend_id) {
+			/* It isn't the same device, assume the old one
+			 * vanished and new one needs to be probed.
+			 */
+			dev_warn(&dev->dev,
+				 "backend-id mismatch (%d != %d), reconnecting\n",
+				 backend_id, dev->otherend_id);
+			dev->vanished = true;
+			device_unregister(&dev->dev);
+			put_device(&dev->dev);
+			dev = NULL;
+		}
+	}
 	if (!dev)
 		xenbus_probe_node(bus, type, root);
 	else
diff --git a/include/xen/xenbus.h b/include/xen/xenbus.h
index 7dab04cf4a36c..43a5335f1d5a3 100644
--- a/include/xen/xenbus.h
+++ b/include/xen/xenbus.h
@@ -87,6 +87,7 @@ struct xenbus_device {
 	struct completion down;
 	struct work_struct work;
 	struct semaphore reclaim_sem;
+	bool vanished;
 
 	/* Event channel based statistics and settings. */
 	atomic_t event_channels;
-- 
2.51.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ