lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 20 Jun 2023 02:02:44 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Brett Creeley <brett.creeley@....com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"yishaih@...dia.com" <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>, "shannon.nelson@....com"
	<shannon.nelson@....com>
Subject: RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

> From: Jason Gunthorpe <jgg@...dia.com>
> Sent: Monday, June 19, 2023 8:47 PM
> 
> On Fri, Jun 16, 2023 at 08:06:21AM +0000, Tian, Kevin wrote:
> 
> > Ideally the VMM has an estimation how long a VM can be paused based on
> > SLA, to-be-migrated state size, available network bandwidth, etc. and that
> > hint should be passed to the kernel so any state transition which may
> violate
> > that expectation can fail quickly to break the migration process and put the
> > VM back to the running state.
> >
> > Jason/Shameer, is there similar concern in mlx/hisilicon drivers?
> 
> It is handled through the vfio_device_feature_mig_data_size mechanism..

that is only for estimation of copied data.

IMHO the stop time when the VM is paused includes both the time of
stopping the device and the time of migrating the VM state.

For a software-emulated device the time of stopping the device is negligible.

But certainly for assigned device the worst-case hard-coded 5s timeout as
done in this patch will kill whatever reasonable 'VM dead time' SLA (usually
in milliseconds) which CSPs try to meet purely based on the size of copied
data.

Wouldn't a user-specified stop-device timeout be required to at least allow
breaking migration early according to the desired SLA?

> 
> > > +	if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && next ==
> > > VFIO_DEVICE_STATE_STOP)
> > > +		return NULL;
> >
> > I'm not sure whether P2P is actually supported here. By definition
> > P2P means the device is stopped but still responds to p2p request
> > from other devices. If you look at mlx example it uses different
> > cmds between RUNNING->RUNNING_P2P and RUNNING_P2P->STOP.
> >
> > But in your case seems you simply move what is required in STOP
> > into P2P. Probably you can just remove the support of P2P like
> > hisilicon does.
> 
> We want new devices to get their architecture right, they need to
> support P2P. Didn't we talk about this already and Brett was going to
> fix it?
> 

Looks it's not fixed since RUNNING_P2P->STOP is a nop in this patch.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ