[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20211105093145.386d0e89.alex.williamson@redhat.com>
Date: Fri, 5 Nov 2021 09:31:45 -0600
From: Alex Williamson <alex.williamson@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
Cornelia Huck <cohuck@...hat.com>,
Yishai Hadas <yishaih@...dia.com>, bhelgaas@...gle.com,
saeedm@...dia.com, linux-pci@...r.kernel.org, kvm@...r.kernel.org,
netdev@...r.kernel.org, kuba@...nel.org, leonro@...dia.com,
kwankhede@...dia.com, mgurtovoy@...dia.com, maorg@...dia.com,
"Dr. David Alan Gilbert" <dgilbert@...hat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver
for mlx5 devices
On Fri, 5 Nov 2021 10:24:04 -0300
Jason Gunthorpe <jgg@...dia.com> wrote:
> On Wed, Nov 03, 2021 at 12:04:11PM -0600, Alex Williamson wrote:
>
> > We agreed that it's easier to add a feature than a restriction in a
> > uAPI, so how do we resolve that some future device may require a new
> > state in order to apply the SET_IRQS configuration?
>
> I would say don't support those devices. If there is even a hint that
> they could maybe exist then we should fix it now. Once the uapi is set
> and documented we should expect device makers to consider it when
> building their devices.
>
> As for SET_IRQs, I have been looking at making documentation and I
> don't like the way the documentation has to be wrriten because of
> this.
>
> What I see as an understandable, clear, documentation is:
>
> - SAVING set - no device touches allowed beyond migration operations
> and reset via XX
I'd suggest defining reset via ioctl only.
> Must be set with !RUNNING
Not sure what this means. Pre-copy requires SAVING and RUNNING
together, is this only suggesting that to get the final device state we
need to do so in a !RUNNING state?
> - RESUMING set - same as SAVING
I take it then that we're defining a new protocol if we can't do
SET_IRQS here.
> - RUNNING cleared - limited device touches in this list: SET_IRQs, XX
> config, XX.
> Device may assume no touches outside the above. (ie no MMIO)
> Implies NDMA
SET_IRQS is MMIO, is the distinction userspace vs kernel?
> - NDMA set - full device touches
> Device may not issue DMA or interrupts (??)
> Device may not dirty pages
Is this achievable? We can't bound the time where incoming DMA is
possible, devices don't have infinite buffers.
> - RUNNING set - full functionality
> * In no state may a device generate an error TLP, device
> hang/integrity failure or kernel intergity failure, no matter
> what userspace does.
> The device is permitted to corrupt the migration/VM or SEGV
> userspace if userspace doesn't follow the rules.
>
> (we are trying to figure out what the XX's are right now, would
> appreciate any help)
>
> This is something I think we could expect a HW engineering team to
> follow and implement in devices. It doesn't complicate things.
>
> Overall, at this moment, I would prioritize documentation clarity over
> strict compatability with qemu, because people have to follow this
> documentation and make their devices long into the future. If the
> documentation is convoluted for compatibility reasons HW people are
> more likely to get it wrong. When HW people get it wrong they are more
> likely to ask for "quirks" in the uAPI to fix their mistakes.
I might still suggest a v2 migration sub-type, we'll just immediately
deprecate the original as we have no users and QEMU would modify all
support to find only the new sub-type as code is updated. "v1" never
really materialized, but we can avoid future confusion if it's never
produced by in-tree drivers and never consume by mainstream userspace.
> The pending_bytes P2P idea is also quite complicated to document as
> now we have to describe an HW state not in terms of a NDMA control
> bit, but in terms of a bunch of implicit operations in a protocol. Not
> so nice.
>
> So, here is what I propose. Let us work on some documentation and come
> up with the sort of HW centric docs like above and we can then decide
> if we want to make the qemu changes it will imply, or not. We'll
> include the P2P stuff, as we see it, so it shows a whole picture.
>
> I think that will help everyone participate fully in the discussion.
Good plan.
> > If we're going to move forward with the existing uAPI, then we're going
> > to need to start factoring compatibility into our discussions of
> > missing states and protocols. For example, requiring that the device
> > is "quiesced" when the _RUNNING bit is cleared and "frozen" when
> > pending_bytes is read has certain compatibility advantages versus
> > defining a new state bit.
>
> Not entirely, to support P2P going from RESUMING directly to RUNNING
> is not possible. There must be an in between state that all devices
> reach before they go to RUNNING. It seems P2P cannot be bolted into
> the existing qmeu flow with a kernel only change?
Perhaps, yes.
> > clarifications were trying for within the existing uAPI rather than
> > toss out new device states and protocols at every turn for the sake of
> > API purity. The rate at which we're proposing new states and required
> > transitions without a plan for the uAPI is not where I want to be for
> > adding the driver that could lock us in to a supported uAPI. Thanks,
>
> Well, to be fair, the other cases I suggested new stats was when you
> asked about features we don't have at all today (like post-copy). I
> think adding new states is a very reasonable way to approach adding
> new features. As long as new features can be supported with new states
> we have a forward compatability story.
That has a viable upgrade path, I'm onboard with that. A device that
imposes it can't do SET_IRQS while RESUMING when we have no required
state in between RESUMING and RUNNING are the sorts of issues that I'm
going to get hung up on. I take it from the above that you're building
that state transition requirement into the uAPI now. Thanks,
Alex
Powered by blists - more mailing lists