[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACePvbVs8haJQSNjpfc35GYZB5Bk7svLup4ddJwq40bNz8a8Vw@mail.gmail.com>
Date: Thu, 2 Oct 2025 23:57:16 -0700
From: Chris Li <chrisl@...nel.org>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, "Rafael J. Wysocki" <rafael@...nel.org>,
Danilo Krummrich <dakr@...nel.org>, Len Brown <lenb@...nel.org>,
Pasha Tatashin <pasha.tatashin@...een.com>, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, linux-acpi@...r.kernel.org,
David Matlack <dmatlack@...gle.com>, Pasha Tatashin <tatashin@...gle.com>,
Jason Miu <jasonmiu@...gle.com>, Vipin Sharma <vipinsh@...gle.com>,
Saeed Mahameed <saeedm@...dia.com>, Adithya Jayachandran <ajayachandra@...dia.com>,
Parav Pandit <parav@...dia.com>, William Tu <witu@...dia.com>, Mike Rapoport <rppt@...nel.org>,
Jason Gunthorpe <jgg@...pe.ca>, Leon Romanovsky <leon@...nel.org>
Subject: Re: [PATCH v2 02/10] PCI/LUO: Create requested liveupdate device list
On Tue, Sep 30, 2025 at 8:30 AM Greg Kroah-Hartman
<gregkh@...uxfoundation.org> wrote:
>
> On Tue, Sep 16, 2025 at 12:45:10AM -0700, Chris Li wrote:
> > #define pr_fmt(fmt) "PCI liveupdate: " fmt
> > +#define dev_fmt(fmt) "PCI liveupdate: " fmt
>
> Please no. Use the default dev_ formatting so that people can correct
> track the devices spitting out messages here.
Ack. I received another feedback request to add this private dev_fmt
prefix. I can remove that.
>
> > +#include <linux/types.h>
> > #include <linux/liveupdate.h>
> > +#include "pci.h"
> >
> > #define PCI_SUBSYSTEM_NAME "pci"
>
> I still don't know why this is needed, why?
Oh, this is requested by the LUO subsystem registration interface.
Pasha can comment more on the LUO subsystem API design. Each subsystem
will use a name to do the subsystem registration and lookup.
https://lore.kernel.org/linux-mm/20250929010321.3462457-10-pasha.tatashin@soleen.com/
>
> >
> > +static void stack_push_buses(struct list_head *stack, struct list_head *buses)
> > +{
> > + struct pci_bus *bus;
> > +
> > + list_for_each_entry(bus, buses, node)
> > + list_move_tail(&bus->dev.lu.lu_next, stack);
> > +}
> > +
> > +static void liveupdate_add_dev(struct device *dev, struct list_head *head)
> > +{
> > + dev_info(dev, "collect liveupdate device: flags %x\n", dev->lu.flags);
>
> Debugging code can go away please.
I was considering this as part of the standard kernel print out for
dmesg logging purposes, more than debugging. This is only triggered
when a liveupdate device is requested. A very small handful of the PCI
devices will be liveupdate. The same set of devices will expect to be
restored at the new kernel kexec. If the set of devices mismatch, that
is a bug. If it is not in the dmesg, it is very hard to find out the
set of devices was mismatched.
I consider it similar to when booting up the kernel some storage
driver will report founding storage device "/dev/sda" etc. Those are
not debugging prints.
Please let me know if that is not justifiable or there is another
mechanism to do the above logging.
> > + list_move_tail(&dev->lu.lu_next, head);
> > +}
> > +
> > +static int collect_bus_devices_reverse(struct pci_bus *bus, struct list_head *head)
> > +{
> > + struct pci_dev *pdev;
> > + int count = 0;
> > +
> > + list_for_each_entry_reverse(pdev, &bus->devices, bus_list) {
>
> Why are you allowed to walk the pci bus list here? Shouldn't there be
> some type of core function to do that?
Core function you mean the device core? This is the PCI liveupdate
core function.
>
> And why in reverse?
Very good question. Reverse is to allow the later created device to
mark the earlier created device as dependent. For example, the VF can
mark the PF as a dependent device.
> > + if (pdev->dev.lu.flags & LU_BUSMASTER && pdev->dev.parent)
> > + pdev->dev.parent->lu.flags |= LU_BUSMASTER_BRIDGE;
> > + if (pdev->dev.lu.flags) {
> > + liveupdate_add_dev(&pdev->dev, head);
> > + count++;
>
> No locking?
Locking in the parent calling function.
>
> > + }
> > + }
> > + return count;
>
> What prevents this value from changing right after you return it?
The liveupdate device collection is static per prepare() call back.
Each PCI device will need to go through the same set of callbacks:
prepare(), freeze() and after kexec, finish(). It will be a bug if
some device calls freeze() without prepare(). For each liveupdate
session, the number of devices partiticate liveupdate is fixed. e.g.
if a new VM tries to add another GPU device into the liveupdate set
after the prepare stage, it is not allowed. This list of liveupdate
devices will remain fixed during the liveupdate session.
Please see Pasha's LUO V4 patch for more detail of the LUO state and
callback when transitioning the state.
https://lore.kernel.org/linux-mm/20250929010321.3462457-1-pasha.tatashin@soleen.com/
> > +}
> > +
> > +static int build_liveupdate_devices(struct list_head *head)
> > +{
> > + LIST_HEAD(bus_stack);
> > + int count = 0;
> > +
> > + stack_push_buses(&bus_stack, &pci_root_buses);
> > +
> > + while (!list_empty(&bus_stack)) {
> > + struct device *busdev;
> > + struct pci_bus *bus;
> > +
> > + busdev = list_last_entry(&bus_stack, struct device, lu.lu_next);
> > + bus = to_pci_bus(busdev);
> > + if (!busdev->lu.visited && !list_empty(&bus->children)) {
> > + stack_push_buses(&bus_stack, &bus->children);
> > + busdev->lu.visited = 1;
> > + continue;
> > + }
> > +
> > + count += collect_bus_devices_reverse(bus, head);
> > + busdev->lu.visited = 0;
> > + list_del_init(&busdev->lu.lu_next);
> > + }
> > + return count;
>
> A comment here about what you are trying to do with walking the list of
This is the postraversal of the PCI bus. Make sure to visit the child
bus before the parent bus, so that the child bus can mark the parent
bus as dependent. e.g. If we want to preserve the bus master bit on a
leaf node PCI device, all the parent bridge up to the root bridge will
need to preserve the bus master bit as well. Otherwise the device will
not be able to DMA if the parent bridge does not have a busmaster.
> devices. Somehow. Are you sure that's right? It feels backwards, and
I am confident this is right. This liveupdate device list collection
is the core value of the LUO/PCI series. This way it only needs to
walk the tree just once. Rather than the two passes, first pass marks
the parent recursively to the root, then the second pass to walk the
tree to collect the bus/device that is marked.
Let me know if you think that is a bug somewhere. This code has been
running in our internal liveupdate kernel for vfio and DMA liveupdate
for over a month now.
> the lack of any locking makes me very nervous. How is this integrating
> into the normal driver model lists?
It does not integrate in the normal driver model lists. It is a new
list. Because the list needs to be fixed in a liveupdate session, it
is simpler to use a new list than walking the existing driver model
list to find out which device has the liveupdate flags. Because the VM
can add a GPU then remove the GPU before reaching to the kernel
liveupdate kexec, that also impacts the depended device (bridge/PF).
The liveupdate requested devices set is dynamic before prepare().
Creating a new list in prepare() is simpler.
> > +}
> > +
> > +static void cleanup_liveupdate_devices(struct list_head *head)
> > +{
> > + struct device *d, *n;
> > +
> > + list_for_each_entry_safe(d, n, head, lu.lu_next) {
> > + d->lu.flags &= ~LU_DEPENDED;
> > + list_del_init(&d->lu.lu_next);
> > + }
> > +}
>
> What does "cleanup" mean?
Cleanup means removing the dependent device flags, which is derivative
from the requested PCI devices. The devices are also removed from the
liveupdate device list because it is not in the liveupdate session any
more.
This clean up happens when transition from the "finish" to "normal"
state due to finish(), or from "prepare" to "normal" due to cancel().
>
> > +
> > static int pci_liveupdate_prepare(void *arg, u64 *data)
> > {
> > + LIST_HEAD(requested_devices);
> > +
> > pr_info("prepare data[%llx]\n", *data);
>
> Addresses written to the kernel log?
Ack, my bad. Will fix that, as point out by Jason as well.
>
> > +
> > + pci_lock_rescan_remove();
> > + down_write(&pci_bus_sem);
> > +
> > + build_liveupdate_devices(&requested_devices);
>
> Ah, you lock here. Document the heck out of this and put the proper
In this function or in build_liveupdate_devices() as well?
> build macros in there so we know what is going on.
build macros you mean the assert some lock was held that kind of the macro?
Let me know if you have other types of build macros in mind.
>
> > + cleanup_liveupdate_devices(&requested_devices);
> > +
> > + up_write(&pci_bus_sem);
>
> Why is it a write? You aren't modifying the list, are you?
I am modifying the requested_devices list but not the normal PCI
device driver model list. I think I can change to up_read().
>
> > + pci_unlock_rescan_remove();
> > return 0;
> > }
> >
> > diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> > index e8318fd5f6ed537a1b236a3a0f054161d5710abd..0e9ef387182856771d857181d88f376632b46f0d 100644
> > --- a/drivers/pci/pcie/portdrv.c
> > +++ b/drivers/pci/pcie/portdrv.c
> > @@ -304,6 +304,7 @@ static int pcie_device_init(struct pci_dev *pdev, int service, int irq)
> > device = &pcie->device;
> > device->bus = &pcie_port_bus_type;
> > device->release = release_pcie_device; /* callback to free pcie dev */
> > + dev_liveupdate_init(device);
>
> Why here?
Because the device is just allocated. I need to initialize the
device->lu.list pointers. Otherwise list debug will complain of NULL
pointers when added to the liveupdate device list.
>
> > dev_set_name(device, "%s:pcie%03x",
> > pci_name(pdev),
> > get_descriptor_id(pci_pcie_type(pdev), service));
> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index 4b8693ec9e4c67fc1655e0057b3b96b4098e6630..dddd7ebc03d1a6e6ee456e0bf02ab9833a819509 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -614,6 +614,7 @@ static struct pci_bus *pci_alloc_bus(struct pci_bus *parent)
> > INIT_LIST_HEAD(&b->devices);
> > INIT_LIST_HEAD(&b->slots);
> > INIT_LIST_HEAD(&b->resources);
> > + dev_liveupdate_init(&b->dev);
>
> Same, why here? Shouldn't the driver core be doing this all for you
> automatically? Are you going to make each bus do this manually?
No, the PCI enumeration happened way before the driver core was
registering the device. We already need to add the device to the
liveupdate device list during the PCI enumeration. That is before the
driver is bound and probed.
Yes, it needs to happen when the bus is allocated and initialized.
Earlier than driver init.
> > b->max_bus_speed = PCI_SPEED_UNKNOWN;
> > b->cur_bus_speed = PCI_SPEED_UNKNOWN;
> > #ifdef CONFIG_PCI_DOMAINS_GENERIC
> > @@ -1985,6 +1986,7 @@ int pci_setup_device(struct pci_dev *dev)
> > dev->sysdata = dev->bus->sysdata;
> > dev->dev.parent = dev->bus->bridge;
> > dev->dev.bus = &pci_bus_type;
> > + dev_liveupdate_init(&dev->dev);
>
> Looks like you are :(
Yes, I need it initialized earlier. Suggestions are welcome. I haven't
found a way to insert the dev_liveupdate_init() into some device init
function. The existing device init function was called too late.
> Do it in one place please.
Which place? If there is such a function called by all different
flavors of device and initialized early enough, I am happy to move
there. There is none as far as I can tell.
> > dev->hdr_type = hdr_type & 0x7f;
> > dev->multifunction = !!(hdr_type & 0x80);
> > dev->error_state = pci_channel_io_normal;
> > @@ -3184,7 +3186,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
> > return NULL;
> >
> > bridge->dev.parent = parent;
> > -
> > + dev_liveupdate_init(&bridge->dev);
>
> Again, one place.
Any suggestions where to move to. dev_liveupdate_init() is the one
place to perform the work. Just need to have multiple entrances. I
can't find an alternative yet.
> > list_splice_init(resources, &bridge->windows);
> > bridge->sysdata = sysdata;
> > bridge->busnr = bus;
> > diff --git a/include/linux/dev_liveupdate.h b/include/linux/dev_liveupdate.h
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..72297cba08a999e89f7bc0997dabdbe14e0aa12c
> > --- /dev/null
> > +++ b/include/linux/dev_liveupdate.h
> > @@ -0,0 +1,44 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@...een.com>
> > + * Chris Li <chrisl@...nel.org>
> > + */
> > +#ifndef _LINUX_DEV_LIVEUPDATE_H
> > +#define _LINUX_DEV_LIVEUPDATE_H
> > +
> > +#include <linux/liveupdate.h>
> > +
> > +#ifdef CONFIG_LIVEUPDATE
> > +
> > +enum liveupdate_flag {
> > + LU_BUSMASTER = 1 << 0,
> > + LU_BUSMASTER_BRIDGE = 2 << 0,
>
> BIT() please.
Ack. Will do.
>
> > +};
> > +
> > +#define LU_REQUESTED (LU_BUSMASTER)
> > +#define LU_DEPENDED (LU_BUSMASTER_BRIDGE)
>
> Why 2 names for the same thing?
LU_DEPENDED is for all dependent devices, the derivatives device gets
pulled into the liveupdate device set. When a liveupdate session gets
canceled. Those derivatives device sets need to be cleaned up. There
will be more liveupdate feature flags (e.g. LU_DMA, LU_SRIOV) added to
the requested and dependent flags. These two are the aggregate flags
for all requested features and dependent features.
> > +
> > +/**
> > + * struct dev_liveupdate - Device state for live update operations
> > + * @lu_next: List head for linking the device into live update
> > + * related lists (e.g., a list of devices participating
> > + * in a live update sequence).
> > + * @flags: Indicate what liveupdate feature does the device
> > + * participtate.
> > + * @visited: Only used by the bus devices when travese the PCI buses
> > + * to build the liveupdate devices list. Set if the child
> > + * buses have been pushed into the pending stack.
> > + *
> > + * This structure holds the state information required for performing
> > + * live update operations on a device. It is embedded within a struct device.
> > + */
> > +struct dev_liveupdate {
> > + struct list_head lu_next;
>
> Another list?
Yes, as explained earlier, a fixed list for the liveupdate session.
>
> > + enum liveupdate_flag flags;
> > + bool visited:1;
>
> You shouldn't need this, you "know" you only touch one device at a time
> when walking a bus, don't try to manually keep track of it on your own.
No, I do need this due to the postravesal visit of the bus. I need to
know if this is the first time I visit this bus, if it is, walk its
children bus first. Else means the second time you visit this bus, all
the children bus has been visited, now add this bus to the liveupdate
list if it has non zero liveupdate flags.
I can do recursive bus walking without using additional bits to
indicate if this is the first time I visit the bus. But recursive tree
walking in the kernel is considered bad due to the stack usage.
> And again, why is the pci core doing this, the driver core should be
> doing all of this, PLEASE do not bury driver-model-core-changes down in
The driver core does not have the knowledge of doing this, e.g. the PF
and VF relationship. The reason liveupdate struct was added to the
device struct is because device struct is embedded in pci_dev,
pci_host_bridge, pci_bus. That is the three structs I care about for
the liveupdate.
The alternative is adding livedupate struct in the above three structs
separately without touching device struct. The patch will be bigger if
possible. I recall having some problem due to the bus->bridge being a
device struct rather than pci_dev or pci_host_bridge. I can try that
again if you think that is better.
> a "PCI" patch. That will make the driver core maintainers very grumpy
> when they run across stuff like this (as it did here...)
Driver core I assume you mean the core around "struct device". As I
explained earlier, it needs PCI special knowledge outside of the
common driver core. Suggestion welcome how to unset you less :-)
> > +};
> > +
> > +#endif /* CONFIG_LIVEUPDATE */
> > +#endif /* _LINUX_DEV_LIVEUPDATE_H */
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 4940db137fffff4ceacf819b32433a0f4898b125..e0b35c723239f1254a3b6152f433e0412cd3fb34 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -21,6 +21,7 @@
> > #include <linux/lockdep.h>
> > #include <linux/compiler.h>
> > #include <linux/types.h>
> > +#include <linux/dev_liveupdate.h>
>
> Look, driver core changes. Please do this all in stuff that is NOT for
> just PCI.
But I only have PCI devices that are supported. Should I also consider
having an entry point in the device and then PCI as one of the
register device subsystems knows about livedupate and get called that
way?
Another way is just move livedupate all into PCI related struct only.
> > #include <linux/mutex.h>
> > #include <linux/pm.h>
> > #include <linux/atomic.h>
> > @@ -508,6 +509,7 @@ struct device_physical_location {
> > * @pm_domain: Provide callbacks that are executed during system suspend,
> > * hibernation, system resume and during runtime PM transitions
> > * along with subsystem-level and driver-level callbacks.
> > + * @lu: Live update state.
>
> You have more letters, please use them. "lu" is too short.
>
> > * @em_pd: device's energy model performance domain
> > * @pins: For device pin management.
> > * See Documentation/driver-api/pin-control.rst for details.
> > @@ -603,6 +605,10 @@ struct device {
> > struct dev_pm_info power;
> > struct dev_pm_domain *pm_domain;
> >
> > +#ifdef CONFIG_LIVEUPDATE
> > + struct dev_liveupdate lu;
> > +#endif
>
> Why not a pointer?
To avoid allocating additional memory failure during the repaire()
callback try to set a dependent device as liveupdate. Actually now
prepare() can be cancelled. I can make this as a pointer and
dynamically allocate the lu struct as well if that is prefered.
>
> > +
> > #ifdef CONFIG_ENERGY_MODEL
> > struct em_perf_domain *em_pd;
> > #endif
> > @@ -1168,4 +1174,13 @@ void device_link_wait_removal(void);
> > #define MODULE_ALIAS_CHARDEV_MAJOR(major) \
> > MODULE_ALIAS("char-major-" __stringify(major) "-*")
> >
> > +#ifdef CONFIG_LIVEUPDATE
> > +static inline void dev_liveupdate_init(struct device *dev)
> > +{
> > + INIT_LIST_HEAD(&dev->lu.lu_next);
>
> Why does this have to be in device.h? The driver core should do this
> for you (as I say above).
I need a more specific pointer which driver core function can do for
me. PCI device enumeration happens pretty early, that is before
registering the device.
Thanks for the long detailed feedback. I am still working on my way to
catch up on my email.
Chris
Powered by blists - more mailing lists