[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e6ab32d7-b1eb-428b-95e8-a90f7b3be39c@suse.com>
Date: Mon, 17 Nov 2025 12:06:47 +0100
From: Jürgen Groß <jgross@...e.com>
To: Marek Marczykowski-Górecki
<marmarek@...isiblethingslab.com>, linux-kernel@...r.kernel.org
Cc: Stefano Stabellini <sstabellini@...nel.org>,
Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>,
Peng Jiang <jiang.peng9@....com.cn>, Qiu-ji Chen <chenqiuji666@...il.com>,
Jason Andryuk <jason.andryuk@....com>,
"moderated list:XEN HYPERVISOR INTERFACE" <xen-devel@...ts.xenproject.org>
Subject: Re: [PATCH] xen/xenbus: better handle backend crash
On 02.11.25 04:20, Marek Marczykowski-Górecki wrote:
> When the backend domain crashes, coordinated device cleanup is not
> possible (as it involves waiting for the backend state change). In that
> case, toolstack forcefully removes frontend xenstore entries.
> xenbus_dev_changed() handles this case, and triggers device cleanup.
> It's possible that toolstack manages to connect new device in that
> place, before xenbus_dev_changed() notices the old one is missing. If
> that happens, new one won't be probed and will forever remain in
> XenbusStateInitialising.
>
> Fix this by checking backend-id and if it changes, consider it
> unplug+plug operation. It's important that cleanup on such unplug
> doesn't modify xenstore entries (especially the "state" key) as it
> belong to the new device to be probed - changing it would derail
> establishing connection to the new backend (most likely, closing the
> device before it was even connected). Handle this case by setting new
> xenbus_device->vanished flag to true, and check it before changing state
> entry.
>
> And even if xenbus_dev_changed() correctly detects the device was
> forcefully removed, the cleanup handling is still racy. Since this whole
> handling doesn't happend in a single xenstore transaction, it's possible
> that toolstack might put a new device there already. Avoid re-creating
> the state key (which in the case of loosing the race would actually
> close newly attached device).
>
> The problem does not apply to frontend domain crash, as this case
> involves coordinated cleanup.
>
> Problem originally reported at
> https://lore.kernel.org/xen-devel/aOZvivyZ9YhVWDLN@mail-itl/T/#t,
> including reproduction steps.
>
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@...isiblethingslab.com>
Sorry I didn't get earlier to this.
My main problem with this patch is that it is basically just papering over
a more general problem.
You are just making the problem much more improbable, but not impossible to
occur again. In case the new driver domain has the same domid as the old one
you can still have the same race.
The clean way to handle that would be to add a unique Id in Xenstore to each
device on the backend side, which can be tested on the frontend side to
match. In case it doesn't match, an old device with the same kind and devid
can be cleaned up.
The unique Id would obviously need to be set by the Xen tools inside the
transaction writing the initial backend Xenstore nodes, as doing that from
the backend would add another potential ambiguity by the driver domain
choosing the same unique id as the previous one did.
The question is whether something like your patch should be used as a
fallback in case there is no unique Id on the backend side of the device
due to a too old Xen version.
Juergen
Download attachment "OpenPGP_0xB0DE9DD628BF132F.asc" of type "application/pgp-keys" (3684 bytes)
Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (496 bytes)
Powered by blists - more mailing lists