linux-kernel - Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1ddb439-9e28-4a58-ba86-0395bfc081e0@redhat.com>
Date: Wed, 13 Nov 2024 18:59:53 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Doug Covelli <doug.covelli@...adcom.com>
Cc: Zack Rusin <zack.rusin@...adcom.com>,
 Sean Christopherson <seanjc@...gle.com>, kvm <kvm@...r.kernel.org>,
 Jonathan Corbet <corbet@....net>, Thomas Gleixner <tglx@...utronix.de>,
 Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 the arch/x86 maintainers <x86@...nel.org>, "H. Peter Anvin" <hpa@...or.com>,
 Shuah Khan <shuah@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
 Arnaldo Carvalho de Melo <acme@...hat.com>,
 Isaku Yamahata <isaku.yamahata@...el.com>, Joel Stanley <joel@....id.au>,
 Linux Doc Mailing List <linux-doc@...r.kernel.org>,
 "Kernel Mailing List, Linux" <linux-kernel@...r.kernel.org>,
 linux-kselftest <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific
 hypercalls

On 11/13/24 17:24, Doug Covelli wrote:
>> No worries, you're not hijacking :) The only reason is that it would
>> be more code for a seldom used feature and anyway with worse performance.
>> (To be clear, CR8 based accesses are allowed, but stores cause an exit
>> in order to check the new TPR against IRR. That's because KVM's API
>> does not have an equivalent of the TPR threshold as you point out below).
> 
> I have not really looked at the code but it seems like it could also
> simplify things as CR8 would be handled more uniformly regardless of
> who is virtualizing the local APIC.

Not much because CR8 basically does not exist at all (it's just a byte
in memory) with userspace APIC.  So it's not easy to make it simpler, even
though it's less uniform.

That said, there is an optimization: you only get KVM_EXIT_SET_TPR if
CR8 decreases.

>>> Also I could not find these documented anywhere but with MSFT's APIC our monitor
>>> relies on extensions for trapping certain events such as INIT/SIPI plus LINT0
>>> and SVR writes:
>>>
>>> UINT64 X64ApicInitSipiExitTrap    : 1; // WHvRunVpExitReasonX64ApicInitSipiTrap
>>> UINT64 X64ApicWriteLint0ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>> UINT64 X64ApicWriteLint1ExitTrap  : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>> UINT64 X64ApicWriteSvrExitTrap    : 1; // WHvRunVpExitReasonX64ApicWriteTrap
>>
>> There's no need for this in KVM's in-kernel APIC model. INIT and
>> SIPI are handled in the hypervisor and you can get the current
>> state of APs via KVM_GET_MPSTATE. LINT0 and LINT1 are injected
>> with KVM_INTERRUPT and KVM_NMI respectively, and they obey IF/PPR
>> and NMI blocking respectively, plus the interrupt shadow; so
>> there's no need for userspace to know when LINT0/LINT1 themselves
>> change. The spurious interrupt vector register is also handled
>> completely in kernel.
> 
> I realize that KVM can handle LINT0/SVR updates themselves but our
> interrupt subsystem relies on knowing the current values of these
> registers even when not virtualizing the local APIC.  I suppose we
> could use KVM_GET_LAPIC to sync things up on demand but that seems
> like it might nor be great from a performance point of view.

Ah no, you're right---you want to track the CPU that has ExtINT enabled
and send KVM_INTERRUPT to that one, I guess?  And you need the spurious
vector registers because writes can set the mask bit in LINTx, but
essentially you want to trap LINT0 changes.

Something like this (missing the KVM_ENABLE_CAP and KVM_CHECK_EXTENSION
code) is good, feel free to include it in your v2 (Co-developed-by
and Signed-off-by me):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5fb29ca3263b..b7dd89c99613 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -122,6 +122,7 @@
  #define KVM_REQ_HV_TLB_FLUSH \
  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_REPORT_LINT0_ACCESS	KVM_ARCH_REQ(35)
  
  #define CR0_RESERVED_BITS                                               \
  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -775,6 +776,7 @@ struct kvm_vcpu_arch {
  	u64 smi_count;
  	bool at_instruction_boundary;
  	bool tpr_access_reporting;
+	bool lint0_access_reporting;
  	bool xfd_no_write_intercept;
  	u64 ia32_xss;
  	u64 microcode_version;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 88dc43660d23..0e070f447aa2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1561,6 +1561,21 @@ static u32 apic_get_tmcct(struct kvm_lapic *apic)
  			      apic->divide_count));
  }
  
+static void __report_lint0_access(struct kvm_lapic *apic, u32 value)
+{
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	struct kvm_run *run = vcpu->run;
+
+	kvm_make_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu);
+	run->lint0_access.value = value;
+}
+
+static inline void report_lint0_access(struct kvm_lapic *apic, u32 value)
+{
+	if (apic->vcpu->arch.lint0_access_reporting)
+		__report_lint0_access(apic, value);
+}
+
  static void __report_tpr_access(struct kvm_lapic *apic, bool write)
  {
  	struct kvm_vcpu *vcpu = apic->vcpu;
@@ -2312,8 +2327,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
  			int i;
  
  			for (i = 0; i < apic->nr_lvt_entries; i++) {
-				kvm_lapic_set_reg(apic, APIC_LVTx(i),
-					kvm_lapic_get_reg(apic, APIC_LVTx(i)) | APIC_LVT_MASKED);
+				u32 old = kvm_lapic_get_reg(apic, APIC_LVTx(i));
+				kvm_lapic_set_reg(apic, APIC_LVTx(i), old | APIC_LVT_MASKED);
+				if (i == 0 && !(old & APIC_LVT_MASKED))
+					report_lint0_access(apic, old | APIC_LVT_MASKED);
  			}
  			apic_update_lvtt(apic);
  			atomic_set(&apic->lapic_timer.pending, 0);
@@ -2352,6 +2369,8 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
  		if (!kvm_apic_sw_enabled(apic))
  			val |= APIC_LVT_MASKED;
  		val &= apic_lvt_mask[index];
+		if (index == 0 && val != kvm_lapic_get_reg(apic, reg))
+			report_lint0_access(apic, val);
  		kvm_lapic_set_reg(apic, reg, val);
  		break;
  	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d0d3dc3b7ef6..2b039b372c3f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10879,6 +10879,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  			kvm_vcpu_flush_tlb_guest(vcpu);
  #endif
  
+		if (kvm_check_request(KVM_REQ_REPORT_LINT0_ACCESS, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_LINT0_ACCESS;
+			r = 0;
+			goto out;
+		}
  		if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
  			vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
  			r = 0;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..ec97727f9de4 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -178,6 +178,7 @@ struct kvm_xen_exit {
  #define KVM_EXIT_NOTIFY           37
  #define KVM_EXIT_LOONGARCH_IOCSR  38
  #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_LINT0_ACCESS     40
  
  /* For KVM_EXIT_INTERNAL_ERROR */
  /* Emulate instruction failed. */
@@ -283,6 +284,10 @@ struct kvm_run {
  				__u64 flags;
  			};
  		} hypercall;
+		/* KVM_EXIT_LINT0_ACCESS */
+		struct {
+			__u32 value;
+		} lint0_access;
  		/* KVM_EXIT_TPR_ACCESS */
  		struct {
  			__u64 rip;


For LINT1, it should be less performance critical; if it's possible
to just go through all vCPUs, and do KVM_GET_LAPIC to check who you
should send a KVM_NMI to, then I'd do that.  I'd also accept a patch
that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor
if it's useful for you.

And since I've been proven wrong already, what do you need INIT/SIPI for?

Paolo