linux-kernel - Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240416014931.GW3039520@ls.amr.corp.intel.com>
Date: Mon, 15 Apr 2024 18:49:31 -0700
From: Isaku Yamahata <isaku.yamahata@...el.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Rick P Edgecombe <rick.p.edgecombe@...el.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	Isaku Yamahata <isaku.yamahata@...el.com>,
	Kai Huang <kai.huang@...el.com>,
	"federico.parola@...ito.it" <federico.parola@...ito.it>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>,
	"dmatlack@...gle.com" <dmatlack@...gle.com>,
	"michael.roth@....com" <michael.roth@....com>,
	"pbonzini@...hat.com" <pbonzini@...hat.com>,
	isaku.yamahata@...ux.intel.com
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for
 KVM_MAP_MEMORY

On Mon, Apr 15, 2024 at 02:17:02PM -0700,
Sean Christopherson <seanjc@...gle.com> wrote:

> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> > 
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> 
> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.
> 
> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging.  If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state.  In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
> 
> I would describe the overall cons for this patch versus returning an error
> differently.  Switching MMU state puts the complexity in the kernel.  Returning
> an error punts any complexity to userspace.  Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>  
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Ok, here is a experimental patch on top of the 7/10 to return error.  Is this
a direction? or do we want to invoke KVM page fault handler without any check?

I can see the following options.

- Error if vCPU is in SMM mode or guest mode: This patch
  Defer the decision until the use cases come up.  We can utilize
  KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future
  enhancement.
  Pro: Keep room for future enhancement for unclear use cases to defer
       the decision.
  Con: The use space VMM has to check/switch the vCPU mode.

- No check of vCPU mode and go on
  Pro: It works.
  Con: Unclear how the uAPI should be without concrete use cases.

- Always populate with L1 GPA:
  This is a bad idea.

---
 arch/x86/kvm/x86.c | 32 +++++++++-----------------------
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ba9c1720ac9..2f3ceda5c225 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5871,10 +5871,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 			     struct kvm_memory_mapping *mapping)
 {
-	struct kvm_mmu *mmu = NULL, *walk_mmu = NULL;
 	u64 end, error_code = 0;
 	u8 level = PG_LEVEL_4K;
-	bool is_smm;
 	int r;
 
 	/*
@@ -5884,25 +5882,21 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 	if (!tdp_enabled)
 		return -EOPNOTSUPP;
 
-	/* Force to use L1 GPA despite of vcpu MMU mode. */
-	is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
-	if (is_smm ||
-	    vcpu->arch.mmu != &vcpu->arch.root_mmu ||
-	    vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
-		vcpu->arch.hflags &= ~HF_SMM_MASK;
-		mmu = vcpu->arch.mmu;
-		walk_mmu = vcpu->arch.walk_mmu;
-		vcpu->arch.mmu = &vcpu->arch.root_mmu;
-		vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
-		kvm_mmu_reset_context(vcpu);
-	}
+	/*
+	 * SMM mode results in populating SMM memory space with memslots id = 1.
+	 * guest mode results in populating with L2 GPA.
+	 * Don't support those cases for now and punt them for the future
+	 * discussion.
+	 */
+	if (is_smm(vcpu) || is_guest_mode(vcpu))
+		return -EOPNOTSUPP;
 
 	/* reload is optimized for repeated call. */
 	kvm_mmu_reload(vcpu);
 
 	r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
 	if (r)
-		goto out;
+		return r;
 
 	/* mapping->base_address is not necessarily aligned to level-hugepage. */
 	end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
@@ -5910,14 +5904,6 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 	mapping->size -= end - mapping->base_address;
 	mapping->base_address = end;
 
-out:
-	/* Restore MMU state. */
-	if (is_smm || mmu) {
-		vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0;
-		vcpu->arch.mmu = mmu;
-		vcpu->arch.walk_mmu = walk_mmu;
-		kvm_mmu_reset_context(vcpu);
-	}
 	return r;
 }
 
-- 
2.43.2

-- 
Isaku Yamahata <isaku.yamahata@...el.com>