linux-kernel - RE: [PATCH v3 22/22] kvm: x86: Disable interception for IA32

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <BN9PR11MB52769CDA8F6FA1FB4E766B308C459@BN9PR11MB5276.namprd11.prod.outlook.com>
Date:   Thu, 30 Dec 2021 07:04:51 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     "Christopherson,, Sean" <seanjc@...gle.com>
CC:     "Liu, Jing2" <jing2.liu@...el.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kselftest@...r.kernel.org" <linux-kselftest@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "corbet@....net" <corbet@....net>,
        "shuah@...nel.org" <shuah@...nel.org>,
        "Nakajima, Jun" <jun.nakajima@...el.com>,
        "jing2.liu@...ux.intel.com" <jing2.liu@...ux.intel.com>,
        "Zeng, Guang" <guang.zeng@...el.com>,
        "Wang, Wei W" <wei.w.wang@...el.com>,
        "Zhong, Yang" <yang.zhong@...el.com>
Subject: RE: [PATCH v3 22/22] kvm: x86: Disable interception for IA32_XFD on
 demand

> From: Tian, Kevin
> Sent: Thursday, December 30, 2021 9:28 AM
> 
> > As posted, there is zero chance that the patches correctly handling #NM in
> L2
> > because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being
> > forwarded
> > to L1.  E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm()
> > will
> > queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the
> #NM
> > from
> > L2 will get injected into L1 after the nested exit.
> 
> Yes, it's incorrect behavior.
> 
> >
> > That also means handle_exception_nmi_irqoff() => handle_exception_nm()
> > is fatally
> > broken on non-XFD hardware, as it will attempt
> RDMSR(MSR_IA32_XFD_ERR)
> > if L1
> > intercepts #NM since handle_exception_nmi_irqoff() will run before
> > __vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or
> > L1 should
> > handle the exit.
> 
> Ditto. Thanks for pointing out those facts that we obviously overlooked.
> 
> So handle_exception_nm() should neither blindly read xfd_err (#NM might
> be
> caused by L1 interception on a non-xfd platform) nor blindly queue a #NM
> exception (triggered in L2) to L1 which intercepts #NM (then expects vm-exit)
> 
> The former suggests that reading xfd_err should be made conditionally
> similar to what we did in vmx_update_exception_bitmap(). The latter
> suggests the actual exception is better postponed until __vmx_handle_exit().
> 
> We are looking at this change now.
> 
> And once #NM handler works correctly to handle interception by either L0
> or L1, I'm not sure whether we still want to special case L1 vs. L2 in
> vmx_update_exception_bitmap(), since in the end L0 wants to intercept
> #NM to save xfd_err for both L1 and L2 (if exposed with xfd capability by L1).
> 

the new change is like below.

static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
	/*
	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
	 * guest value is not clobbered by the host activity before the guest
	 * has chance to consume it.
	 *
	 * Since trapping #NM is started when xfd write interception is
	 * disabled, using this flag to guard the saving operation. This
	 * implies no-op for a non-xfd #NM due to L1 interception.
	 *
	 * Queuing exception is done in vmx_handle_exit.
	 */
	if (vcpu->arch.xfd_no_write_intercept)
		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}

in the final series it will first check vcpu->arch.guest_fpu.fpstate->xfd
before the disable interception patch is applied and then becomes
the above form, similar to your suggestion on vmx_update_exception_bitmap().

whether to check msr_bitmap vs. an extra flag is an orthogonal open.

Then:

handle_exception_nmi(struct kvm_vcpu *vcpu)
{
	...
	if (is_machine_check(intr_info) || is_nmi(intr_info))
		return 1; /* handled by handle_exception_nmi_irqoff() */

	/*
	 * Queue the exception here instead of in handle_nm_fault_irqoff().
	 * This ensures the nested_vmx check is not skipped so vmexit can
	 * be reflected to L1 (when it intercepts #NM) before reaching this
	 * point.
	 */
	if (is_nm_fault(intr_info)) {
		kvm_queue_exception(vcpu, NM_VECTOR);
		return 1;
	}

	...
}

Then regarding to test non-AMX nested #NM usage, it might be difficult
to trigger it from modern OS. As commented by Linux #NM handler, it's
expected only for XFD or math emulation when fpu is missing. So we plan
to run a selftest in L1 which sets CR0.TS and then touch fpu registers. and
for L1 kernel we will run two binaries with one trapping #NM and the other
not.

Please let us know if you have a better suggestion for the testing here.

Thanks
Kevin