linux-kernel - Re: [PATCH 08/28] KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEBQchT0cpCKkmQ6@google.com>
Date: Wed, 4 Jun 2025 06:56:02 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Borislav Petkov <bp@...en8.de>, Xin Li <xin@...or.com>, Dapeng Mi <dapeng1.mi@...ux.intel.com>
Subject: Re: [PATCH 08/28] KVM: nSVM: Use dedicated array of MSRPM offsets to
 merge L0 and L1 bitmaps

On Wed, Jun 04, 2025, Chao Gao wrote:
> >diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> >index 89a77f0f1cc8..e53020939e60 100644
> >--- a/arch/x86/kvm/svm/nested.c
> >+++ b/arch/x86/kvm/svm/nested.c
> >@@ -184,6 +184,64 @@ void recalc_intercepts(struct vcpu_svm *svm)
> > 	}
> > }
> > 
> >+static int nested_svm_msrpm_merge_offsets[9] __ro_after_init;
> 
> I understand how the array size (i.e., 9) was determined :). But, adding a
> comment explaining this would be quite helpful 

Yeah, I'll write a comment explaining what all is going on.

> >+static int nested_svm_nr_msrpm_merge_offsets __ro_after_init;
> >+
> >+int __init nested_svm_init_msrpm_merge_offsets(void)
> >+{
> >+	const u32 merge_msrs[] = {
> >+		MSR_STAR,
> >+		MSR_IA32_SYSENTER_CS,
> >+		MSR_IA32_SYSENTER_EIP,
> >+		MSR_IA32_SYSENTER_ESP,
> >+	#ifdef CONFIG_X86_64
> >+		MSR_GS_BASE,
> >+		MSR_FS_BASE,
> >+		MSR_KERNEL_GS_BASE,
> >+		MSR_LSTAR,
> >+		MSR_CSTAR,
> >+		MSR_SYSCALL_MASK,
> >+	#endif
> >+		MSR_IA32_SPEC_CTRL,
> >+		MSR_IA32_PRED_CMD,
> >+		MSR_IA32_FLUSH_CMD,
> 
> MSR_IA32_DEBUGCTLMSR is missing, but it's benign since it shares the same
> offset as MSR_IA32_LAST* below.

Gah.  Once all is said and done, it's not supposed to be in this list because its
only passed through for SEV-ES guests, but my intent was to keep it in this patch,
and then removeit along with XSS, EFER, PAT, GHCB, and TSC_AUX in the next.

This made me realize that merging in chunks has a novel flaw: if there is an MSR
that KVM *doesn't* want to give L2 access to, then KVM needs to ensure its offset
isn't processed, i.e. that there isn't a "collision" with another MSR.  I don't
think it's a major concern, because the x2APIC MSRs are nicely isolated, and off
the top of my head I can't think of any MSRs that fall into that bucket.  But it's
something worth calling out in a comment, at least.

> I'm a bit concerned that we might overlook adding new MSRs to this array in the
> future, which could lead to tricky bugs. But I have no idea how to avoid this.

Me either.  One option would be to use wrapper macros for the interception helpers
to fill an array at compile time (similar to how kernel exception fixup works),
but (a) it requires modifications to the linker scripts to generate the arrays,
(b) is quite confusing/complex, and (c) it doesn't actually solve the problem,
it just inverts the problem.  Because as above, there are MSRs we *don't* want
to expose to L2, and so we'd need to add yet more code to filter those out.

And the failure mode for the inverted case would be worse, because if we missed
an MSR, KVM would incorrectly give L2 access to an MSR.  Whereas with the current
approach, a missed MSR simply means L2 gets a slower path; but functionally, it's
fine (and it has to be fine, because KVM can't force L1 to disable interception).

> Removing this array and iterating over direct_access_msrs[] directly is an
> option but it contradicts this series as one of its purposes is to remove
> direct_access_msrs[].

Using direct_access_msrs[] wouldn't solve the problem either, because nothing
ensures _that_ array is up-to-date either.

The best idea I have is to add a test that verifies the MSRs that are supposed
to be passed through actually are passed through.  It's still effectively manual
checking, but it would require us to screw up twice, i.e. forget to update both
the array and the test.  The problem is that there's no easy and foolproof way to
verify that an MSR is passed through in a selftest.

E.g. it would be possible to precisely detect L2 => L0 MSR exits via a BPF program,
but incorporating a BPF program into a KVM selftest just to detect exits isn't
something I'm keen on doing (or maintaining).

Using the "exits" stat isn't foolproof due to NMIs (IRQs can be accounted for via
"irq_exits", and to a lesser extent page faults (especially if shadow paging is
in use).

If KVM provided an "msr_exits" stats, it would be trivial to verify interception
via a selftest, but I can't quite convince myself that MSR exits are interesting
enough to warrant their own stat.

> >+		MSR_IA32_LASTBRANCHFROMIP,
> >+		MSR_IA32_LASTBRANCHTOIP,
> >+		MSR_IA32_LASTINTFROMIP,
> >+		MSR_IA32_LASTINTTOIP,
> >+
> >+		MSR_IA32_XSS,
> >+		MSR_EFER,
> >+		MSR_IA32_CR_PAT,
> >+		MSR_AMD64_SEV_ES_GHCB,
> >+		MSR_TSC_AUX,
> >+	};
> 
> 
> > 
> > 		if (kvm_vcpu_read_guest(vcpu, offset, &value, 4))
> >diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> >index 1c70293400bc..84dd1f220986 100644
> >--- a/arch/x86/kvm/svm/svm.c
> >+++ b/arch/x86/kvm/svm/svm.c
> >@@ -5689,6 +5689,10 @@ static int __init svm_init(void)
> > 	if (!kvm_is_svm_supported())
> > 		return -EOPNOTSUPP;
> > 
> >+	r = nested_svm_init_msrpm_merge_offsets();
> >+	if (r)
> >+		return r;
> >+
> 
> If the offset array is used for nested virtualization only, how about guarding
> this with nested virtualization? For example, in svm_hardware_setup():

Good idea, I'll do that in the next version.

> 	if (nested) {
> 		r = nested_svm_init_msrpm_merge_offsets();
> 		if (r)
> 			goto err;
> 
> 		pr_info("Nested Virtualization enabled\n");
> 		kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE);
> 	}
> 
> 
> > 	r = kvm_x86_vendor_init(&svm_init_ops);
> > 	if (r)
> > 		return r;
> >diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> >index 909b9af6b3c1..0a8041d70994 100644
> >--- a/arch/x86/kvm/svm/svm.h
> >+++ b/arch/x86/kvm/svm/svm.h
> >@@ -686,6 +686,8 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
> > 	return vmcb12_is_intercept(&svm->nested.ctl, INTERCEPT_NMI);
> > }
> > 
> >+int __init nested_svm_init_msrpm_merge_offsets(void);
> >+
> > int enter_svm_guest_mode(struct kvm_vcpu *vcpu,
> > 			 u64 vmcb_gpa, struct vmcb *vmcb12, bool from_vmrun);
> > void svm_leave_nested(struct kvm_vcpu *vcpu);
> >-- 
> >2.49.0.1204.g71687c7c1d-goog
> >