linux-kernel - RE: [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR11MB3256A20F6BB9218BDB5B7988A9B89@BYAPR11MB3256.namprd11.prod.outlook.com>
Date:   Thu, 14 Oct 2021 11:21:37 +0000
From:   "Liu, Jing2" <jing2.liu@...el.com>
To:     Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>
CC:     "x86@...nel.org" <x86@...nel.org>,
        "Bae, Chang Seok" <chang.seok.bae@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "Arjan van de Ven" <arjan@...ux.intel.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "Nakajima, Jun" <jun.nakajima@...el.com>,
        Jing Liu <jing2.liu@...ux.intel.com>,
        "seanjc@...gle.com" <seanjc@...gle.com>,
        "Cooper, Andrew" <andrew.cooper3@...rix.com>
Subject: RE: [patch 13/31] x86/fpu: Move KVMs FPU swapping to FPU core

On 10/14/2021 5:01 PM, Paolo Bonzini wrote:

> On 14/10/21 10:02, Liu, Jing2 wrote:
> >> In principle I don't like it very much; it would be nicer to say "you
> >> enable it for QEMU itself via arch_prctl(ARCH_SET_STATE_ENABLE), and
> >> for the guests via ioctl(KVM_SET_CPUID2)".  But I can see why you
> >> want to keep things simple, so it's not a strong objection at all.
> >
> > Does this mean that KVM allocate 3 buffers via
> > 1) Qemu's request, instead of via 2) guest XCR0 trap?
> 
> Based on the input from Andy and Thomas, the new way would be like this:
> 
> 1) host_fpu must always be checked for reallocation in kvm_load_guest_fpu
> (or in the FPU functions that it calls, that depends on the rest of Thomas's
> patches).  That's because arch_prctl can enable AMX for QEMU at any point
> after KVM_CREATE_VCPU.

For Qemu's XFD, I'd like to confirm that:
Since the arch_prctl() onlys add current->group_leader->thread.fpu's  state_perm,
__state_size, (current->thread.fpu.* is not changed), thus in
kvm_load_guest_fpu, host_fpu->xfd is always 1. That is to say, Qemu's arch_prctl()
doesn't change any copies of XFD.

> 
> 2) every use of vcpu->arch.guest_supported_xcr0 is changed to only include
> those dynamic-feature bits that were enabled via arch_prctl.
> That is, something like:
> 
> static u64 kvm_guest_supported_cr0(struct kvm_vcpu *vcpu) {
> 	return vcpu->arch.guest_supported_xcr0 &
> 		(~xfeatures_mask_user_dynamic | \
> 		 current->thread.fpu.dynamic_state_perm);
> }
> 
> 3) Even with passthrough disabled, the guest can run with XFD set to
> vcpu->arch.guest_xfd (and likewise for XFD_ERR) which is much simpler
> than trapping #NM.  The traps for writing XCR0 and XFD are used to allocate
> dynamic state for guest_fpu, and start the passthrough of XFD and XFD_ERR.
> What we need is:
> 
> - if a dynamic state has XCR0[n]=0, bit n will never be set in XFD_ERR and the
> state will never be dirtied by the guest.
> 
> - if a dynamic state has XCR0[n]=1, but all enabled dynamic states have
> XFD[n]=1, the guest is not able to dirty any dynamic XSAVE state, because
> they all have either XCR0[n]=0 or XFD[n]=1.  An attempt to do so will cause an
> #NM trap and set the bit in XFD_ERR.
> 
> - if a dynamic state has XCR0[n]=1 and XFD[n]=0, the state for bit n is
> allocated in guest_fpu, and it can also disable the vmexits for XFD and
> XFD_ERR.
> 

Got it, the principle is once XCR0[n]=1 and XFD[n]=0, then guest is allowed
to use the dynamic XSAVE state, thus KVM must prepare all things well
before. This probably happens shortly after guest #NM.

Only one thing: it seems we assume that vcpu->arch.xfd is guest runtime
value. And before guest initializes XFD, KVM provides
vcpu->arch.xfd[18]=1, right? But the spec asks XFD reset value as zero.
If so, between guest init XCR0 to 1 and init XFD to 1, it's XCR0[n]=1 and
XFD[n]=0. If a guest never init XFD and directly use dynamic state...

Or do we want to provide guest a XFD[18]=1 value at the very beginning?

> Therefore:
> 
> - if passthrough is disabled, the XCR0 and XFD write traps can check
> guest_xcr0 & ~guest_xfd.  If it includes a dynamic state bit, dynamic state is
> allocated for all bits enabled in guest_xcr0 and passthrough is started; this
> should happen shortly after the guest gets its first #NM trap for AMX.
> 
> - if passthrough is enabled, the XCR0 write trap must still ensure that
> dynamic state is allocated for all bits enabled in guest_xcr0.
> 
> So something like this pseudocode is called by both XCR0 and XFD writes:
> 
> int kvm_alloc_fpu_dynamic_features(struct kvm_vcpu *vcpu) {
> 	u64 allowed_dynamic = current->thread.fpu.dynamic_state_perm;
> 	u64 enabled_dynamic =
> 		vcpu->arch.xcr0 & xfeatures_mask_user_dynamic;
> 
> 	/* All dynamic features have to be arch_prctl'd first.  */
> 	WARN_ON_ONCE(enabled_dynamic & ~allowed_dynamic);
> 
> 	if (!vcpu->arch.xfd_passthrough) {
> 		/* All dynamic states will #NM?  Wait and see.  */
> 		if ((enabled_dynamic & ~vcpu->arch.xfd) == 0)
Here, when guest init XCR0 to 1, vcpu->arch.xfd should be 1
otherwise XCR0 trap makes passthrough and allocates buffer, which
is not what we want.

> 			return 0;
> 
> 		kvm_x86_ops.enable_xfd_passthrough(vcpu);
> 	}
> 
> 	/* current->thread.fpu was already handled by arch_prctl.  */
It seems so far, arch_prctl does not change current->thread.fpu,
only #NM handler itself does it. We here alloc current too.

Thanks,
Jing
> 	return fpu_alloc_features(vcpu->guest_fpu,
> 		vcpu->guest_fpu.dynamic_state_perm | enabled_dynamic); }
> 
> Paolo