linux-kernel - RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <SA1PR21MB1317FAD9C895C085F8872899BF4F2@SA1PR21MB1317.namprd21.prod.outlook.com>
Date: Fri, 25 Oct 2024 18:18:59 +0000
From: Dexuan Cui <decui@...rosoft.com>
To: Michael Kelley <mhklinux@...look.com>, Naman Jain
	<namjain@...ux.microsoft.com>, KY Srinivasan <kys@...rosoft.com>, Haiyang
 Zhang <haiyangz@...rosoft.com>, Wei Liu <wei.liu@...nel.org>
CC: "linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, John Starks
	<John.Starks@...rosoft.com>, "jacob.pan@...ux.microsoft.com"
	<jacob.pan@...ux.microsoft.com>, Easwar Hariharan
	<eahariha@...ux.microsoft.com>
Subject: RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

> From: Michael Kelley <mhklinux@...look.com>
> Sent: Tuesday, October 22, 2024 11:04 AM
> [...]
> I wasn't aware of the VF handling. Where does the guest notify the
> host that it is planning to hibernate? I went looking for such code, but
> couldn't immediately find it.  Is it in the netvsc driver? Is this the
> sequence?
> 
> 1) The guest notifies the host of the hibernate
> 2) The host sends a RESCIND_CHANNELOFFER message for each VF
>     in the VM
> 3) The guest waits for all VF rescind processing to complete, and
>     also must ensure that no new VFs get added in the meantime
> 4) Then the guest proceeds with the hibernation, knowing that there
>     are no open channels for VF devices

When a hibernated VM resumes on a different host, it looks like the host team
thinks that it's difficult to remember the VMBus device Instance GUID for the
VF, and use the same GUID on the new host. When the new host uses a new
Instance GUID for the VF, a Windows VM panics, and a Linux VM prints a
warning and IIRC loses the ability to hibernate again due to a check in the
VMBus driver.

So, as a workaround, the host team decides to remove the VF(s) before
asking the VM to hibernate. The sequence of a "host-initiated VM hibernation"
is:
1) a user clicks the "Hibernation" button on the portal (or uses the equivalent
cmd line or API).

2) Internally, the host temporarily disables AccelNet for the vNICs, i.e. sending
PCI_EJECT and RESCIND_CHANNELOFFER for each VF.

3) The guest responds accordingly, including sending PCI_EJECTION_COMPLETE
and CHANNELMSG_RELID_RELEASED.

4) Once the host knows that AccelNet has been disabled for the VM, the host
Sends a "please hibernate" message to the guest via the Shutdown IC.

5) The guest proceeds with the hibernation, knowing that there are no open
channels for VF devices and assuming no new VF will be offered during the
hibernation operation.

6) When the VM finishes hibernation and powers off, the host internally enables
AccelNet for the VM so that when the VM resumes on a new host, the new host
can offer a VF with a different VMBus device instance GUID.

The above is for a "host-initiated VM hibernation".

Currently, the Azure team doesn't support a "VM-initiated hibernation", where
the host has no opportunity to temporarily disable AccelNet. Maybe 
"VM-initiated hibernation" can be supported when MANA-Direct is used (i.e.
no more NetVSC NICs and there are only MANA VF NICs): in this scenario, I
suppose the host must remember the MANA VF's VMBus device Instance GUID
and use the same GUID on the new host.

> > The behavior we want is for the guest to hot remove the MLX device
> > driver on resume, even if the MLX device was still present at suspend,
> > so that the host does not need this special pre-hibernate behavior. This
> > patch series may not be sufficient to ensure this, though. It just moves
> > things in the right direction, by handling the all-offers-delivered
> > message.

I'm not sure if it's a good idea to add new code to try to remove an 
stale MLX VF since the scenario should not exist on Azure nowadays 
(currently the host temporarily disables AccelNet during hibernation so there
should be no stale MLX VF upon resume.)

On a local Hyper-V host, after a VM hibernates, we can manually disable
AccelNet (i.e. NIC SR-IOV) for the VM, and the VM will see a stale unresponsive
MLX VF upon resume. It would be tricky to clean up the VF gracefully:
we would have to wait for the resume callback in the Mellanox VF driver
to time out on the unresponsive VF (this can take 1 minute) and clean up the
related VMBus pass-through device backing the VF; what happens if a
host-initiated or VM-initiated hibernation is triggered during the 1 minute?
I suspect there may be some tricky race condition issues, e.g. we may 
need to figure out how to synchronize the .resume with the .remove callbacks
of the MLX driver.

I think the general assumption is that the VM's configuration should not
change at all across hibernation, but it looks like this assumption is found
to be false under some conditions from time to time... I wish the assumption
can be always true with OpenHCL.

Thanks,
Dexuan