[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB527618B02E6BAF6EF03D2E0C8C3C9@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Wed, 23 Feb 2022 02:02:07 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: Alex Williamson <alex.williamson@...hat.com>,
Jason Gunthorpe <jgg@...dia.com>
CC: Yishai Hadas <yishaih@...dia.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"saeedm@...dia.com" <saeedm@...dia.com>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"kuba@...nel.org" <kuba@...nel.org>,
"leonro@...dia.com" <leonro@...dia.com>,
"kwankhede@...dia.com" <kwankhede@...dia.com>,
"mgurtovoy@...dia.com" <mgurtovoy@...dia.com>,
"maorg@...dia.com" <maorg@...dia.com>,
"cohuck@...hat.com" <cohuck@...hat.com>,
"Raj, Ashok" <ashok.raj@...el.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>
Subject: RE: [PATCH V8 mlx5-next 09/15] vfio: Define device migration protocol
v2
> From: Alex Williamson <alex.williamson@...hat.com>
> Sent: Wednesday, February 23, 2022 9:10 AM
> > > > + * The kernel migration driver must fully transition the device to the
> new state
> > > > + * value before the operation returns to the user.
> > >
> > > The above statement certainly doesn't preclude asynchronous
> > > availability of data on the stream FD, but it does demand that the
> > > device state transition itself is synchronous and can cannot be
> > > shortcut. If the state transition itself exceeds migration SLAs, we're
> > > in a pickle. Thanks,
> >
> > Even if the commands were async, it is not easy to believe a device
> > can instantaneously abort an arc when a timer hits and return to full
> > operation. For instance, mlx5 can't do this.
> >
> > The vCPU cannot be restarted to try to meet the SLA until a command
> > going back to RUNNING returns.
> >
> > If we want to have a SLA feature it feels better to pass in the
> > deadline time as part of the set state ioctl and the driver can then
> > internally do something appropriate and not have to figure out how to
> > juggle an external abort. The driver would be expected to return fully
> > completed from STOP or return back to RUNNING before the deadline.
> >
> > For instance mlx5 could possibly implement this by checking the
> > migration size and doing some maths before deciding if it should
> > commit to its unabortable device command.
> >
> > I have a feeling supporting SLA means devices are going to have to
> > report latencies for various arcs and work in a more classical
> > realtime deadline oriented way overall. Estimating the transfer
> > latency and size is another factor too.
> >
> > Overall, this SLA topic looks quite big to me, and I think a full
> > solution will come with many facets. We are also quite interested in
> > dirty rate limiting, for instance.
>
> So if/when we were to support this, we might use a different SET_STATE
> feature ioctl that allows the user to specify a deadline and we'd use
> feature probing or a flag on the migration feature for userspace to
> discover this? I'd be ok with that, I just want to make sure we have
> agreeable options to support it. Thanks,
>
Or use a different device_feature ioctl to allow setting deadline
for different arcs before changing device state and then reuse
existing SET_STATE semantics with the migration driver doing
estimation underlyingly based on pre-configured constraints...
Thanks
Kevin
Powered by blists - more mailing lists