[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mtm2loml.fsf@redhat.com>
Date: Wed, 17 Nov 2021 17:42:58 +0100
From: Cornelia Huck <cohuck@...hat.com>
To: Yishai Hadas <yishaih@...dia.com>, alex.williamson@...hat.com,
bhelgaas@...gle.com, jgg@...dia.com, saeedm@...dia.com
Cc: linux-pci@...r.kernel.org, kvm@...r.kernel.org,
netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
kwankhede@...dia.com, mgurtovoy@...dia.com, yishaih@...dia.com,
maorg@...dia.com
Subject: vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add
mlx5 live migration driver)
Ok, here's the contents (as of 2021-11-17 16:30 UTC) of the etherpad at
https://etherpad.opendev.org/p/VFIOMigrationDiscussions -- in the hope
of providing a better starting point for further discussion (I know that
discussions are still ongoing in other parts of this thread; but
frankly, I'm getting a headache trying to follow them, and I think it
would be beneficial to concentrate on the fundamental questions
first...)
VFIO migration: current state and open questions
Current status
* Linux
* uAPI has been merged with a8a24f3f6e38 ("vfio: UAPI for migration
interface for device state") in 5.8
* no kernel user of the uAPI merged
* Several out of tree drivers apparently
* support for mlx5 currently on the list (latest:
https://lore.kernel.org/all/20211027095658.144468-1-yishaih@nvidia.com/
with discussion still happening on older versions)
* support for HiSilicon ACC devices is on the list too. Adds support
for HiSilicon crypto accelerator VF device live migration. These are
simple DMA queue based PCIe integrated endpoint devices. No support
for P2P and doesn't use DMA for migration. <latest:
https://lore.kernel.org/lkml/20210915095037.1149-1-shameerali.kolothum.thodi@huawei.com/>
* QEMU
* basic support added in 5.2, some fixes later
* support for vfio-pci only so far, still experimental
("x-enable-migration") as of 6.2
* Only tested with out of tree drivers
* other software?
Problems/open questions
* Are the status bits currently defined in the uAPI
(_RESUMING/_SAVING/_RUNNING) sufficient to express all states we need?
* What does clearing _RUNNING imply? In particular, does it mean that
the device is frozen, or are some operations still permitted?
* various points brought up: P2P, SET_IRQS, ... <please summarize :)>:
* P2P DMA support between devices requires an additional HW control
state where the device can receive but not transmit DMA
* No definition of what HW needs to preserve when RESUMING toggles
off - (eg today SET_IRQS must work, what else?).
* In general, how do IRQs work with future non-trapping IMS?
* Dealing with pending IOMMU PRI faults during migration
* Problems identified with the !RUNNING state:
* When there are multiple devices in a user context (VM), we can't
atomically move all devices to the !_RUNNING state concurently.
* Suggests the current uAPI has a usage restriction for
environments that do not make use of peer-to-peer DMA (ie. we
can't have a device generating DMA to a p2p target that cannot
accept it - especially if error response from target can generate
host fatal conditions)
* Possible userspace implications:
* VMs could be limited to a single device to guarantee that no
p2p exists - non-vfio devices generating physical p2p DMA in the
future is a concern
* Hypervisor may skip creating p2p DMA mappings, creating a
choice whether the VM supports migration or p2p
* Jason proposed a new NDMA (no-dma) state that seems to match the
mlx5 implementation of "quiesce" vs "freeze" states, where NDMA
would indicate the device cannot generate DMA or interrupts such
that once userspace places all devices into the (NDMA | RUNNING)
state the environment is fully quiesced. A flag or capability on
the migration region could indicate support for this feature.
* Alex proposed that this could be equally resolved within the
current device states if !RUNNING becomes the quiescent point
where the device stops generating DMA and interrupts, with a
requirement that the user moves all devices to !RUNNING before
collecting device migration data (as indicated by reading
pending_bytes) or else risk corrupting the migration data, which
the device could indicate via an errno in the migration process.
A flag or capability would still be required to indicate this
support.
* Jason does not favor this approach, objecting that the mode
transition is implicit, and still needs qemu changes anyhow
* In general, what operations or accesses is the user restricted
from performing on the device while !RUNNING
* Jason has proposed very restricted access (essentially none
beyond the migration region itself), including no MMIO access
<20211028234750.GP2744544@...dia.com> This essentially imposes
device transmission to an intermediate state between SAVING and
RUNNING.
* Alex requested a formal uAPI update defining what accesses are
allowed, including which regions and ioctls.
* The existing uAPI does not require any such transition to a
"null" state or TBD new device state bit. QEMU currently
expects access to config space and the ability to call SET_IRQS
and create mmaps while in the RESUMING state, without the
RUNNING bit set. Restoring MSI-X interrupt configuration
necessarily requires MMIO access to the device.
* Jason suggested a new device state bit and user protocol to
account for this, where the device is in a !RUNNING and
!RESTORING, but to some degree becomes manipulable via device
regions and ioctls. No compatibility mechanism proposed.
* Alex suggested that this is potentially supportable via a spec
clarification that requires the device migration data to be
written to completion before userspace performs other region or
ioctl access to the device. (mlx5's driver is designed to not
inspect the migration blob itself, so it can't detect the
"end". The migration blob is finished when mlx5 sees RESUMING
clear.)
* PRI into the guest (guest user process SVA) has a sequencing
problem with RUNNING - can not migrate a vIOMMU in the middle of a
page fault, must stop and flush faults before stopping vCPUs
* The uAPI could benefit from some more detailed documentation
(e.g. how to use it, what to do in edge cases, ...) outside of the
header file.
* Trying to use the mlx5 support currently on the list has unearthed
some problems in QEMU <please summarize :)>
* Discussion regarding dirty tracking and how much it should be
controlled by user space still ongoing
* General questions:
* How much do we want to change the uAPI and/or the documentation to
accommodate what QEMU has implemented so far?
* How much do we want to change QEMU?
Possible solutions
* uAPI
* fine as is, or
* needs some clarifications, or
* needs rework, which might mean a v2
* QEMU
* fine as is (modulo bugfixes), or
* needs some rework, but not impacting the uAPI, or
* needs some rework, which also needs some changes in the uAPI
* Suggested approach:
* Work on the documentation, and try to come up with some more
HW-centric docs
* Depending on that, decide how many changes we want/need to do in
QEMU
Powered by blists - more mailing lists