lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 26 Jun 2023 07:31:31 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: Jason Gunthorpe <jgg@...dia.com>
CC: Brett Creeley <brett.creeley@....com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"yishaih@...dia.com" <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>, "shannon.nelson@....com"
	<shannon.nelson@....com>
Subject: RE: [PATCH v10 vfio 4/7] vfio/pds: Add VFIO live migration support

> From: Jason Gunthorpe <jgg@...dia.com>
> Sent: Wednesday, June 21, 2023 9:27 PM
> 
> On Wed, Jun 21, 2023 at 06:49:12AM +0000, Tian, Kevin wrote:
> 
> > What is the criteria for 'reasonable'? How does CSPs judge that such
> > device can guarantee a *reliable* reasonable window so live migration
> > can be enabled in the production environment?
> 
> The CSP needs to work with the device vendor to understand how it fits
> into their system, I don't see how we can externalize this kind of
> detail in a general way.
> 
> > I'm afraid that we are hiding a non-deterministic factor in current protocol.
> 
> Yes
> 
> > But still I don't think it's a good situation where the user has ZERO
> > knowledge about the non-negligible time in the stopping path...
> 
> In any sane device design this will be a small period of time. These
> timeouts should be to protect against a device that has gone wild.
> 

Any example how 'small' it will be (e.g. <1ms)?

Should we define a *reasonable* threshold in VFIO community which
any new variant driver should provide information to judge against?

If the worst-case stop time (assuming the device doesn't go wild) may
exceed the threshold then it's time to consider whether a new interface
is required to communicate such constraint to userspace.

The reason why I keep discussing it is that IMHO achieving negligible
stop time is a very challenging task for many accelerators. e.g. IDXD
can be stopped only after completing all the pending requests. While
it allows software to configure the max pending work size (and a
reasonable setting could meet both migration SLA and performance
SLA) the worst-case draining latency could be in 10's milliseconds which
cannot be ignored by the VMM.

Or do you think it's still better left to CSP working with the device vendor
even in this case, given the worst-case latency could be affected by
many factors hence not something which a kernel driver can accurately
estimate?

Thanks
Kevin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ