netdev - vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mtm2loml.fsf@redhat.com>
Date:   Wed, 17 Nov 2021 17:42:58 +0100
From:   Cornelia Huck <cohuck@...hat.com>
To:     Yishai Hadas <yishaih@...dia.com>, alex.williamson@...hat.com,
        bhelgaas@...gle.com, jgg@...dia.com, saeedm@...dia.com
Cc:     linux-pci@...r.kernel.org, kvm@...r.kernel.org,
        netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
        kwankhede@...dia.com, mgurtovoy@...dia.com, yishaih@...dia.com,
        maorg@...dia.com
Subject: vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add
 mlx5 live migration driver)

Ok, here's the contents (as of 2021-11-17 16:30 UTC) of the etherpad at
https://etherpad.opendev.org/p/VFIOMigrationDiscussions -- in the hope
of providing a better starting point for further discussion (I know that
discussions are still ongoing in other parts of this thread; but
frankly, I'm getting a headache trying to follow them, and I think it
would be beneficial to concentrate on the fundamental questions
first...)

VFIO migration: current state and open questions

Current status
  * Linux
    * uAPI has been merged with a8a24f3f6e38 ("vfio: UAPI for migration
    interface for device state") in 5.8
      * no kernel user of the uAPI merged
      * Several out of tree drivers apparently
    * support for mlx5 currently on the list (latest:
    https://lore.kernel.org/all/20211027095658.144468-1-yishaih@nvidia.com/
    with discussion still happening on older versions)
    * support for HiSilicon ACC devices is on the list too. Adds support
    for HiSilicon crypto accelerator VF device live migration. These are
    simple DMA queue based PCIe integrated endpoint devices. No support
    for P2P and doesn't use DMA for migration. <latest:
    https://lore.kernel.org/lkml/20210915095037.1149-1-shameerali.kolothum.thodi@huawei.com/>
  * QEMU
    * basic support added in 5.2, some fixes later
      * support for vfio-pci only so far, still experimental
      ("x-enable-migration") as of 6.2
      * Only tested with out of tree drivers
  * other software?

Problems/open questions
  * Are the status bits currently defined in the uAPI
  (_RESUMING/_SAVING/_RUNNING) sufficient to express all states we need?
  * What does clearing _RUNNING imply? In particular, does it mean that
  the device is frozen, or are some operations still permitted?
  * various points brought up: P2P, SET_IRQS, ... <please summarize :)>:
    * P2P DMA support between devices requires an additional HW control
    state where the device can receive but not transmit DMA
    * No definition of what HW needs to preserve when RESUMING toggles
    off - (eg today SET_IRQS must work, what else?).
    * In general, how do IRQs work with future non-trapping IMS?
    * Dealing with pending IOMMU PRI faults during migration
  * Problems identified with the !RUNNING state:
    * When there are multiple devices in a user context (VM), we can't
    atomically move all devices to the !_RUNNING state concurently.
      * Suggests the current uAPI has a usage restriction for
      environments that do not make use of peer-to-peer DMA (ie. we
      can't have a device generating DMA to a p2p target that cannot
      accept it - especially if error response from target can generate
      host fatal conditions)
      * Possible userspace implications:
        * VMs could be limited to a single device to guarantee that no
        p2p exists - non-vfio devices generating physical p2p DMA in the
        future is a concern
        * Hypervisor may skip creating p2p DMA mappings, creating a
        choice whether the VM supports migration or p2p
      * Jason proposed a new NDMA (no-dma) state that seems to match the
      mlx5 implementation of "quiesce" vs "freeze" states, where NDMA
      would indicate the device cannot generate DMA or interrupts such
      that once userspace places all devices into the (NDMA | RUNNING)
      state the environment is fully quiesced.  A flag or capability on
      the migration region could indicate support for this feature.
      * Alex proposed that this could be equally resolved within the
      current device states if !RUNNING becomes the quiescent point
      where the device stops generating DMA and interrupts, with a
      requirement that the user moves all devices to !RUNNING before
      collecting device migration data (as indicated by reading
      pending_bytes) or else risk corrupting the migration data, which
      the device could indicate via an errno in the migration process.
      A flag or capability would still be required to indicate this
      support.
        * Jason does not favor this approach, objecting that the mode
        transition is implicit, and still needs qemu changes anyhow
    * In general, what operations or accesses is the user restricted
    from performing on the device while !RUNNING
      * Jason has proposed very restricted access (essentially none
      beyond the migration region itself), including no MMIO access
      <20211028234750.GP2744544@...dia.com>  This essentially imposes
      device transmission to an intermediate state between SAVING and
      RUNNING.
        * Alex requested a formal uAPI update defining what accesses are
        allowed, including which regions and ioctls.
        * The existing uAPI does not require any such transition to a
        "null" state or TBD new device state bit.  QEMU currently
        expects access to config space and the ability to call SET_IRQS
        and create  mmaps while in the RESUMING state, without the
        RUNNING bit set.  Restoring MSI-X interrupt configuration
        necessarily requires MMIO access to the device.
        * Jason suggested a new device state bit and user protocol to
        account for this, where the device is in a !RUNNING and
        !RESTORING, but to some degree becomes manipulable via device
        regions and ioctls.  No compatibility mechanism proposed.
        * Alex suggested that this is potentially supportable via a spec
        clarification that requires the device migration data to be
        written to completion before userspace performs other region or
        ioctl access to the device. (mlx5's driver is designed to not
        inspect the migration blob itself, so it can't detect the
        "end". The migration blob is finished when mlx5 sees RESUMING
        clear.)
    * PRI into the guest (guest user process SVA) has a sequencing
    problem with RUNNING - can not migrate a vIOMMU in the middle of a
    page fault, must stop and flush faults before stopping vCPUs
  * The uAPI could benefit from some more detailed documentation
  (e.g. how to use it, what to do in edge cases, ...) outside of the
  header file.
  * Trying to use the mlx5 support currently on the list has unearthed
  some problems in QEMU <please summarize :)>
  * Discussion regarding dirty tracking and how much it should be
  controlled by user space still ongoing
  * General questions:
    * How much do we want to change the uAPI and/or the documentation to
    accommodate what QEMU has implemented so far?
    * How much do we want to change QEMU?

Possible solutions
  * uAPI
    * fine as is, or
    * needs some clarifications, or
    * needs rework, which might mean a v2
  * QEMU
    * fine as is (modulo bugfixes), or
    * needs some rework, but not impacting the uAPI, or
    * needs some rework, which also needs some changes in the uAPI
  * Suggested approach:
    * Work on the documentation, and try to come up with some more
    HW-centric docs
    * Depending on that, decide how many changes we want/need to do in
    QEMU