[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210929182642.GV964074@nvidia.com>
Date: Wed, 29 Sep 2021 15:26:42 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: Leon Romanovsky <leon@...nel.org>,
Doug Ledford <dledford@...hat.com>,
Yishai Hadas <yishaih@...dia.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Kirti Wankhede <kwankhede@...dia.com>, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
linux-rdma@...r.kernel.org, netdev@...r.kernel.org,
Saeed Mahameed <saeedm@...dia.com>,
Cornelia Huck <cohuck@...hat.com>
Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state
transition validity
On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@...dia.com> wrote:
>
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@...pe.ca> wrote:
> > >
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.
> > > >
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > >
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?
> > >
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced. That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state. Thanks,
> >
> > If we focus on mlx5 there are two device 'flags' to manage:
> > - Device cannot issue DMAs
> > - Device internal state cannot change (ie cannot receive DMAs)
> >
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
>
> Are you anticipating p2p from outside the VM? The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).
Inside the VM.
Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.
If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.
> > The expected sequence at the device is thus
> >
> > Resuming
> > full stop -> does not issue DMAs -> full operation
> > Suspend
> > full operation -> does not issue DMAs -> full stop
> >
> > Further the device has two actions
> > - Trigger serializating the device state
> > - Trigger de-serializing the device state
> >
> > So, what is the behavior upon each state:
> >
> > * 000b => Device Stopped, not saving or resuming
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 001b => Device running, which is the default state
> > Neither flags
> >
> > * 010b => Stop the device & save the device state, stop-and-copy state
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 011b => Device running and save the device state, pre-copy state
> > Neither flags
> > (future, DMA tracking turned on)
> >
> > * 100b => Device stopped and the device state is resuming
> > Does not issue DMAs
> > Internal state cannot change
>
> cannot change... other that as loaded via migration region.
Yes
> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).
Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.
> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.
So lets error on these requests since we don't know what state to put
the device into.
> > The two actions:
> > trigger serializing the device state
> > Done when asked to go to 010b ?
>
> When the _SAVING bit is set. The exact mechanics depends on the size
> and volatility of the device state. A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.
Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.
> > trigger de-serializing the device state
> > Done when transition from 100b -> 000b ?
>
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.
Sorry, I typo'd it, yes to _RUNNING
> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared. Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation. In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt. It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING. If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.
That makes sense enough.
> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
>
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?
Yes, as above
> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful. Thanks,
I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.
Thanks,
Jason
Powered by blists - more mailing lists