[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zh2ZTt4tXXg0f0d9@google.com>
Date: Mon, 15 Apr 2024 14:17:02 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Rick P Edgecombe <rick.p.edgecombe@...el.com>
Cc: "kvm@...r.kernel.org" <kvm@...r.kernel.org>, Isaku Yamahata <isaku.yamahata@...el.com>,
Kai Huang <kai.huang@...el.com>,
"federico.parola@...ito.it" <federico.parola@...ito.it>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"isaku.yamahata@...il.com" <isaku.yamahata@...il.com>, "dmatlack@...gle.com" <dmatlack@...gle.com>,
"michael.roth@....com" <michael.roth@....com>, "pbonzini@...hat.com" <pbonzini@...hat.com>
Subject: Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY
On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> I wouldn't call myself much of an expert on nested, but...
>
> On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@...el.com wrote:
> > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > - Switch vCPU MMU only: This patch.
> > Pros: Concise implementation.
> > Cons: Heavily dependent on the KVM MMU implementation.
Con: Makes it impossible to support other MMUs/modes without extending the uAPI.
> Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be
> wrong?
>
> > - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
> > Use __get/set_sregs2() to switch to/from SMM mode.
> > Pros: straightforward.
> > Cons: This may cause unintended side effects.
>
> Cons make sense.
>
> > - Refactor KVM page fault handler not to pass vCPU. Pass around necessary
> > parameters and struct kvm.
> > Pros: The end result will have clearly no side effects.
> > Cons: This will require big refactoring.
>
> But doesn't the fault handler need the vCPU state?
Ignoring guest MTRRs, which will hopefully soon be a non-issue, no. There are
only six possible roots if TDP is enabled:
1. 4-level !SMM !guest_mode
2. 4-level SMM !guest_mode
3. 5-level !SMM !guest_mode
4. 5-level SMM !guest_mode
5. 4-level !SMM guest_mode
6. 5-level !SMM guest_mode
4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU eliminates
the SMM and guest_mode issues. If there is per-vCPU state that makes its way into
the TDP page tables, then we have problems, because it means that there is per-vCPU
state in per-VM structures that isn't accounted for.
There are a few edge cases where KVM treads carefully, e.g. if the fault is to
the vCPU's APIC-access page, but KVM manually handles those to avoid consuming
per-vCPU state.
That said, I think this option is effectively 1b, because dropping the SMM vs.
guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's
just a different way of doing so.
The first question to answer is, do we want to return an error or "silently"
install mappings for !SMM, !guest_mode. And so this option becomes relevant only
_if_ we want to unconditionally install mappings for the 'base" mode.
> > - Return error on guest mode or SMM mode: Without this patch.
> > Pros: No additional patch.
> > Cons: Difficult to use.
>
> Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> there shouldn't be an issue. If so, maybe this last one is not so horrible.
And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
basically invalidates the argument that returning an error makes the ioctl() hard
to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
existing code, but I don't buy that the ioctl() itself is hard to use.
Literally the only thing userspace needs to do is set CPUID to implicitly select
between 4-level and 5-level paging. If userspace wants to pre-map memory during
live migration, or when jump-starting the guest with pre-defined state, simply
pre-map memory before stuffing guest state. In and of itself, that doesn't seem
difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
I would describe the overall cons for this patch versus returning an error
differently. Switching MMU state puts the complexity in the kernel. Returning
an error punts any complexity to userspace. Specifically, anything that KVM can
do regarding vCPU state to get the right MMU, userspace can do too.
Add on that silently doing things that effectively ignore guest state usually
ends badly, and I don't see a good argument for this patch (or any variant
thereof).
Powered by blists - more mailing lists