linux-kernel - RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB4157D30B50EF5B58BDCBD423D44A2@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Mon, 28 Oct 2024 15:21:05 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Dexuan Cui <decui@...rosoft.com>, Naman Jain
	<namjain@...ux.microsoft.com>, KY Srinivasan <kys@...rosoft.com>, Haiyang
 Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>
CC: "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, John Starks
	<John.Starks@...rosoft.com>, "jacob.pan@...ux.microsoft.com"
	<jacob.pan@...ux.microsoft.com>, Easwar Hariharan
	<eahariha@...ux.microsoft.com>
Subject: RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

From: Dexuan Cui <decui@...rosoft.com> Sent: Friday, October 25, 2024 11:19 AM
> 
> > From: Michael Kelley <mhklinux@...look.com>
> > Sent: Tuesday, October 22, 2024 11:04 AM
> > [...]
> > I wasn't aware of the VF handling. Where does the guest notify the
> > host that it is planning to hibernate? I went looking for such code, but
> > couldn't immediately find it.  Is it in the netvsc driver? Is this the
> > sequence?
> >
> > 1) The guest notifies the host of the hibernate
> > 2) The host sends a RESCIND_CHANNELOFFER message for each VF
> >     in the VM
> > 3) The guest waits for all VF rescind processing to complete, and
> >     also must ensure that no new VFs get added in the meantime
> > 4) Then the guest proceeds with the hibernation, knowing that there
> >     are no open channels for VF devices
> 
> When a hibernated VM resumes on a different host, it looks like the host team
> thinks that it's difficult to remember the VMBus device Instance GUID for the
> VF, and use the same GUID on the new host. When the new host uses a new
> Instance GUID for the VF, a Windows VM panics, and a Linux VM prints a
> warning and IIRC loses the ability to hibernate again due to a check in the
> VMBus driver.
> 
> So, as a workaround, the host team decides to remove the VF(s) before
> asking the VM to hibernate. The sequence of a "host-initiated VM hibernation"
> is:
> 1) a user clicks the "Hibernation" button on the portal (or uses the equivalent
> cmd line or API).
> 
> 2) Internally, the host temporarily disables AccelNet for the vNICs, i.e. sending
> PCI_EJECT and RESCIND_CHANNELOFFER for each VF.
> 
> 3) The guest responds accordingly, including sending PCI_EJECTION_COMPLETE
> and CHANNELMSG_RELID_RELEASED.
> 
> 4) Once the host knows that AccelNet has been disabled for the VM, the host
> Sends a "please hibernate" message to the guest via the Shutdown IC.
> 
> 5) The guest proceeds with the hibernation, knowing that there are no open
> channels for VF devices and assuming no new VF will be offered during the
> hibernation operation.
> 
> 6) When the VM finishes hibernation and powers off, the host internally enables
> AccelNet for the VM so that when the VM resumes on a new host, the new host
> can offer a VF with a different VMBus device instance GUID.
> 
> The above is for a "host-initiated VM hibernation".
> 
> Currently, the Azure team doesn't support a "VM-initiated hibernation", where
> the host has no opportunity to temporarily disable AccelNet. Maybe
> "VM-initiated hibernation" can be supported when MANA-Direct is used (i.e.
> no more NetVSC NICs and there are only MANA VF NICs): in this scenario, I
> suppose the host must remember the MANA VF's VMBus device Instance GUID
> and use the same GUID on the new host.
> 

Thanks for the information, Dexuan! I'm thinking about hibernation
a bit more, and perhaps will write a Linux kernel documentation topic
under Documentation/virt/hyperv that covers the full set of scenarios.
The Hyper-V interactions and assumptions are more complex than I
had realized. Getting them formally documented should be helpful in
the long run.

Michael 

> > > The behavior we want is for the guest to hot remove the MLX device
> > > driver on resume, even if the MLX device was still present at suspend,
> > > so that the host does not need this special pre-hibernate behavior. This
> > > patch series may not be sufficient to ensure this, though. It just moves
> > > things in the right direction, by handling the all-offers-delivered
> > > message.
> 
> I'm not sure if it's a good idea to add new code to try to remove an
> stale MLX VF since the scenario should not exist on Azure nowadays
> (currently the host temporarily disables AccelNet during hibernation so there
> should be no stale MLX VF upon resume.)
> 
> On a local Hyper-V host, after a VM hibernates, we can manually disable
> AccelNet (i.e. NIC SR-IOV) for the VM, and the VM will see a stale unresponsive
> MLX VF upon resume. It would be tricky to clean up the VF gracefully:
> we would have to wait for the resume callback in the Mellanox VF driver
> to time out on the unresponsive VF (this can take 1 minute) and clean up the
> related VMBus pass-through device backing the VF; what happens if a
> host-initiated or VM-initiated hibernation is triggered during the 1 minute?
> I suspect there may be some tricky race condition issues, e.g. we may
> need to figure out how to synchronize the .resume with the .remove callbacks
> of the MLX driver.
> 
> I think the general assumption is that the VM's configuration should not
> change at all across hibernation, but it looks like this assumption is found
> to be false under some conditions from time to time... I wish the assumption
> can be always true with OpenHCL.