[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<adb07a02b3923eeb49f425d38509b340f4837e17.camel@cyberus-technology.de>
Date: Wed, 17 Apr 2024 13:05:02 +0000
From: Thomas Prescher <thomas.prescher@...erus-technology.de>
To: "seanjc@...gle.com" <seanjc@...gle.com>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "x86@...nel.org"
<x86@...nel.org>, "dave.hansen@...ux.intel.com"
<dave.hansen@...ux.intel.com>, "hpa@...or.com" <hpa@...or.com>, Julian
Stecklina <julian.stecklina@...erus-technology.de>, "tglx@...utronix.de"
<tglx@...utronix.de>, "bp@...en8.de" <bp@...en8.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "mingo@...hat.com"
<mingo@...hat.com>
Subject: Re: [PATCH 1/2] KVM: nVMX: fix CR4_READ_SHADOW when L0 updates CR4
during a signal
On Tue, 2024-04-16 at 11:07 -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > On Tue, 2024-04-16 at 08:17 -0700, Sean Christopherson wrote:
> > > On Tue, Apr 16, 2024, Thomas Prescher wrote:
> > > > Hi Sean,
> > > >
> > > > On Tue, 2024-04-16 at 07:35 -0700, Sean Christopherson wrote:
> > > > > On Tue, Apr 16, 2024, Julian Stecklina wrote:
> > > > > > From: Thomas Prescher
> > > > > > <thomas.prescher@...erus-technology.de>
> > > > > >
> > > > > > This issue occurs when the kernel is interrupted by a
> > > > > > signal while
> > > > > > running a L2 guest. If the signal is meant to be delivered
> > > > > > to the L0
> > > > > > VMM, and L0 updates CR4 for L1, i.e. when the VMM sets
> > > > > > KVM_SYNC_X86_SREGS in kvm_run->kvm_dirty_regs, the kernel
> > > > > > programs an
> > > > > > incorrect read shadow value for L2's CR4.
> > > > > >
> > > > > > The result is that the guest can read a value for CR4 where
> > > > > > bits from
> > > > > > L1 have leaked into L2.
> > > > >
> > > > > No, this is a userspace bug. If L2 is active when userspace
> > > > > stuffs
> > > > > register state, then from KVM's perspective the incoming
> > > > > value is L2's
> > > > > value. E.g. if userspace *wants* to update L2 CR4 for
> > > > > whatever
> > > > > reason, this patch would result in L2 getting a stale value,
> > > > > i.e. the
> > > > > value of CR4 at the time of VM-Enter.
> > > > >
> > > > > And even if userspace wants to change L1, this patch is
> > > > > wrong, as KVM
> > > > > is writing vmcs02.GUEST_CR4, i.e. is clobbering the L2 CR4
> > > > > that was
> > > > > programmed by L1, *and* is dropping the CR4 value that
> > > > > userspace wanted
> > > > > to stuff for L1.
> > > > >
> > > > > To fix this, your userspace needs to either wait until L2
> > > > > isn't active,
> > > > > or force the vCPU out of L2 (which isn't easy, but it's
> > > > > doable if
> > > > > absolutely necessary).
> > > >
> > > > What you say makes sense. Is there any way for
> > > > userspace to detect whether L2 is currently active after
> > > > returning from KVM_RUN? I couldn't find anything in the
> > > > official
> > > > documentation https://docs.kernel.org/virt/kvm/api.html
> > > >
> > > > Can you point me into the right direction?
> > >
> > > Hmm, the only way to query that information is via
> > > KVM_GET_NESTED_STATE,
> > > which is a bit unfortunate as that is a fairly "heavy" ioctl().
>
> Hur dur, I forgot that KVM provides a "guest_mode" stat. Userspace
> can do
> KVM_GET_STATS_FD on the vCPU FD to get a file handle to the binary
> stats, and
> then you wouldn't need to call back into KVM just to query
> guest_mode.
>
> Ah, and I also forgot that we have kvm_run.flags, so adding
> KVM_RUN_X86_GUEST_MODE
> would also be trivial (I almost suggested it earlier, but didn't want
> to add a
> new field to kvm_run without a very good reason).
Thanks for the pointers. This is really helpful.
I tried the "guest_mode" stat as you suggested and it solves the
immediate issue we have with VirtualBox/KVM.
What I don't understand is that we do not get the effective CR4 value
of the L2 guest in kvm_run.s.regs.sregs.cr4. Instead, userland sees the
contents of Vmcs::GUEST_CR4. Shouldn't this be the combination of
GUEST_CR4, GUEST_CR4_MASK and CR4_READ_SHADOW, i.e. what L2 actually
sees as CR4 value?
If this is expected, can you please explain the reasoning behind this
interface decision? For me, it does not make sense that writing back
the same value we receive at exit time causes a change in what L2 sees
for CR4.
Another question is: when we want to save the VM state during a
savevm/loadvm cycle, we kick all vCPUs via a singal and save their
state. If any vCPU runs in L2 at the time of the exit, we somehow need
to let it continue to run until we get an exit with the L1 state. Is
there a mechanism to help us here?
>
> > Indeed. What still does not make sense to me is that userspace
> > should be able
> > to modify the L2 state. From what I can see, even in this scenario,
> > L0 exits
> > with the L1 state.
>
> No, KVM exits with L2. Or at least, KVM is supposed to exit with L2
> state. Exiting
> with L1 state would actually be quite difficult, as KVM would need to
> manually
> switch to vmcs01, pull out state, and then switch back to vmcs02.
> And for GPRs,
> KVM doesn't have L1 state because most GPRs are "volatile", i.e. are
> clobbered by
> VM-Enter and thus need to be manually managed by the hypervisor.
>
> I assume you're using kvm_run.kvm_valid_regs to instruct KVM to save
> vCPU state
> when exiting to userspace? That path grabs state straight from the
> vCPU, without
> regard to is_guest_mode(). If you're seeing L1 state, then there's
> likely a bug
> somewhere, likely in userspace again. While valid_regs
> kvm_run.kvm_valid_regs
> might not be heavily used, the underlying code is very heavily used
> for doing
> save/restore while L2 is active, e.g. for live migration.
>
> > So what you are saying is that userspace should be allowed to
> > change L2 even
> > if it receives the architectural state from L1?
>
> No, see above.
>
> > What would be the use-case for this scenario?
> >
> > If the above is true, I think we should document this explicitly
> > because it's not obvious, at least not for me ;)
> >
> > How would you feel if we extend struct kvm_run with a
> > nested_guest_active flag? This should be fairly cheap and would
> > make
> > the integration into VirtualBox/KVM much easier. We could also only
> > sync this flag explicitly, e.g. with a KVM_SYNC_X86_NESTED_GUEST?
>
> Heh, see above regarding KVM_RUN_X86_GUEST_MODE.
Powered by blists - more mailing lists