lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 20 Jun 2023 09:31:04 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: "Tian, Kevin" <kevin.tian@...el.com>
Cc: Brett Creeley <brett.creeley@....com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"yishaih@...dia.com" <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com" <shameerali.kolothum.thodi@...wei.com>,
	"shannon.nelson@....com" <shannon.nelson@....com>
Subject: Re: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

On Tue, Jun 20, 2023 at 02:02:44AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@...dia.com>
> > Sent: Monday, June 19, 2023 8:47 PM
> > 
> > On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:
> > 
> > > Ideally the VMM has an estimation how long a VM can be paused based on
> > > SLA, to-be-migrated state size, available network bandwidth, etc. and that
> > > hint should be passed to the kernel so any state transition which may
> > violate
> > > that expectation can fail quickly to break the migration process and put the
> > > VM back to the running state.
> > >
> > > Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> > 
> > It is handled through the vfio_device_feature_mig_data_size mechanism..
> 
> that is only for estimation of copied data.
> 
> IMHO the stop time when the VM is paused includes both the time of
> stopping the device and the time of migrating the VM state.
> 
> For a software-emulated device the time of stopping the device is negligible.
> 
> But certainly for assigned device the worst-case hard-coded 5s timeout as
> done in this patch will kill whatever reasonable 'VM dead time' SLA (usually
> in milliseconds) which CSPs try to meet purely based on the size of copied
> data.

There is not alot that can be done here, the stop time cannot be
predicted in advance on these devices - the system relies on the
device having a reasonable time window.

> Wouldn't a user-specified stop-device timeout be required to at least allow
> breaking migration early according to the desired SLA?

Not really, the device is going to still execute the stop regardless
of the timeout, and when it does the VM will be broken.

With a FW approach like this it is pretty stuck, we need the FW to
remain in sync as the highest priority.

> > We want new devices to get their architecture right, they need to
> > support P2P. Didn't we talk about this already and Brett was going to
> > fix it?
> 
> Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch.

That could be OK, it needs a comment explaining why it is OK

Jason

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ