lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211115232921.GV2105516@nvidia.com>
Date:   Mon, 15 Nov 2021 19:29:21 -0400
From:   Jason Gunthorpe <jgg@...dia.com>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
        Cornelia Huck <cohuck@...hat.com>,
        Yishai Hadas <yishaih@...dia.com>, bhelgaas@...gle.com,
        saeedm@...dia.com, linux-pci@...r.kernel.org, kvm@...r.kernel.org,
        netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
        kwankhede@...dia.com, mgurtovoy@...dia.com, maorg@...dia.com,
        "Dr. David Alan Gilbert" <dgilbert@...hat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver
 for mlx5 devices

On Fri, Nov 05, 2021 at 09:31:45AM -0600, Alex Williamson wrote:
> On Fri, 5 Nov 2021 10:24:04 -0300
> Jason Gunthorpe <jgg@...dia.com> wrote:
> 
> > On Wed, Nov 03, 2021 at 12:04:11PM -0600, Alex Williamson wrote:
> > 
> > > We agreed that it's easier to add a feature than a restriction in a
> > > uAPI, so how do we resolve that some future device may require a new
> > > state in order to apply the SET_IRQS configuration?  
> > 
> > I would say don't support those devices. If there is even a hint that
> > they could maybe exist then we should fix it now. Once the uapi is set
> > and documented we should expect device makers to consider it when
> > building their devices.
> > 
> > As for SET_IRQs, I have been looking at making documentation and I
> > don't like the way the documentation has to be wrriten because of
> > this.
> > 
> > What I see as an understandable, clear, documentation is:
> > 
> >  - SAVING set - no device touches allowed beyond migration operations
> >    and reset via XX
> 
> I'd suggest defining reset via ioctl only.
> 
> >    Must be set with !RUNNING
> 
> Not sure what this means.  Pre-copy requires SAVING and RUNNING
> together, is this only suggesting that to get the final device state we
> need to do so in a !RUNNING state?

Sorry, I did not think about pre-copy here, mlx5 doesn't do it so I'm
not as familiar

> >  - RESUMING set - same as SAVING
> 
> I take it then that we're defining a new protocol if we can't do
> SET_IRQS here.

We've been working on some documentation and one of the challenges
turns out that all the PCI device state owned by other subsystems (eg
the PCI core, the interrupt code, power management, etc) must be kept
in sync. No matter what RESUMING cannot just async change device state
that the kernel assumes it is controlling.

So, in practice, this necessarily requires forbidding the device from
touching the MSI table, and other stuff, during RESUMING.

Further, since we can't just halt all the other kernel subsystems
during SAVING/RESUMING the device must be able to accept touches in
those areas, for completely unrelated reasons, (eg a MSI addr/data
being changed) safely.

Seems like no need to change SET_IRQs.


> >  - NDMA set - full device touches
> >    Device may not issue DMA or interrupts (??)
> >    Device may not dirty pages
> 
> Is this achievable?  We can't bound the time where incoming DMA is
> possible, devices don't have infinite buffers.

It is a necessary evil for migration. 

The device cannot know how long it will be suspended for and must
cope. With networking discarded packets can be resent, but the reality
is that real deployments need a QOS that the device will not be paused
for too long otherwise the peers may declare the node dead.

> > Not entirely, to support P2P going from RESUMING directly to RUNNING
> > is not possible. There must be an in between state that all devices
> > reach before they go to RUNNING. It seems P2P cannot be bolted into
> > the existing qmeu flow with a kernel only change?
> 
> Perhaps, yes.

We have also been looking at dirty tracking and we are wondering how
that should work. (Dirty tracking will be another followup)

If we look at mlx5, it will have built in dirty tracking, and when
used with a newer IOMMUs there is also system dirty tracking
available.

I think userspace should decide if it wants to use mlx5 built in or
the system IOMMU to do dirty tracking.

Presumably the system IOMMU is turned on via
VFIO_IOMMU_DIRTY_PAGES_FLAG_START, but what controls if the mlx5
mechanism should be used or not?

mlx5 also has no way to return the dirty log. If the system IOMMU is
not used then VFIO_IOMMU_DIRTY_PAGES_FLAG_START should not be done,
however that is what controls all the logic under the two GET_BITMAP
APIs. (even if fixed I don't really like the idea of the IOMMU
extracting this data from the migration driver in the context of
iommufd)

Further how does mlx5 even report that it has dirty tracking?

Was there some plan here we are missing?

In light of all this I'm wondering if device dirty tracking should
exist as new ioctls on the device FD and reserve the type1 code to
only work the IOMMU dirty tracking.

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ