linux-kernel - Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after VMRUN of L2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mxn6y6og34ejncnsvdapcoep4ewcnwnheszhwkp2undkqcu5zv@bpmseexuug5z>
Date: Wed, 11 Feb 2026 00:27:42 +0000
From: Yosry Ahmed <yosry.ahmed@...ux.dev>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] KVM: nSVM: Sync next_rip to cached vmcb12 after
 VMRUN of L2

 
> > > > > In other words, AFAICT, nested.ctl.int_ctl is special in that KVM needs it to be
> > > > > up-to-date at all times, *and* it needs to copied back to vmcb12 (or userspace).
> > > > 
> > > > Hmm actually looking at nested.ctl.int_ctl, I don't think it's that special.
> > > > Most KVM usages are checking "in" bits, i.e. whether some features (e.g.
> > > > vGIF) are enabled or not.
> > > >
> > > > The "out" bits seem to only be consumed by svm_clear_vintr(), and I
> > > > think this can be worked around.
> > > 
> > > OMG that code makes my head hurt.  Isn't that code just this?
> > > 
> > > 	/*
> > > 	 * Drop int_ctl fields related to VINTR injection.  If L2 is active,
> > > 	 * restore the virtual IRQ flag and its vector from vmcb12 now that KVM
> > > 	 * is done usurping virtual IRQs for its own purposes.
> > > 	 */
> > > 	svm->vmcb01.ptr->control.int_ctl &= ~V_IRQ_INJECTION_BITS_MASK;
> > > 
> > > 	if (is_guest_mode(&svm->vcpu)) {
> > > 		svm->vmcb->control.int_ctl = (svm->vmcb->control.int_ctl & ~V_IRQ_MASK) |
> > > 					     (svm->nested.ctl.int_ctl & V_IRQ_MASK);
> > 
> > Also V_INTR_PRIO_MASK, I think?
> 
> Ugh, yes.  And I think V_IGN_TPR as well?  I can't tell if that's a bug or not.
> It looks like a bug.  AFAICT, svm_set_vintr() uses whatever V_IGN_TPR_MASK value
> happens to be in vmcb02.  I don't see how that can be desirable.

Yeah I assumed that's not a bug (for some reason), in which case
restoring V_IGN_TPR_MASK is not needed.

> 
> > But otherwise yeah I think that's what the function is doing
> > more-or-less.
> > 
> > > 		svm->vmcb->control.int_vector = svm->nested.ctl.int_vector;
> > > 	} else {
> > > 		WARN_ON_ONCE(svm->vmcb != svm->vmcb01.ptr);
> > > 	}
> > > 
> > > 	svm_clr_intercept(svm, INTERCEPT_VINTR);
> > > 	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
> > > 
> > > > So maybe we don't really need to keep it up-to-date in the cache at all
> > > > times.
> > > 
> > > Yeah, IMO that approach is unnecessarily convoluted.  The actual logic isn't all
> > > that complex, all of the complexity comes from juggling state between the cache
> > > and vmcb02 just so that the cache can be authoritative.  Given that we failed
> > > miserably in actually making the cache authoritative, e.g. see the nested #VMEXIT
> > > flow, I think we should kill the entire concept and instead maintain an *exact*
> > 
> > When you say *exact* snapshot, do you mean move all the sanitizing logic
> > recently introduced in __nested_copy_vmcb_control_to_cache() (by Kevin
> > and myself) to sanitize vmcb02 instead?
> 
> Oh, no, not _that_ exact.  More "unchanged after emulated VMRUN"

Whew.

> > > diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> > > index ebd7b36b1ceb..2de6305be9ce 100644
> > > --- a/arch/x86/kvm/svm/svm.h
> > > +++ b/arch/x86/kvm/svm/svm.h
> > > @@ -199,14 +199,13 @@ struct svm_nested_state {
> > >          * we cannot inject a nested vmexit yet.  */
> > >         bool nested_run_pending;
> > >  
> > > -       /* cache for control fields of the guest */
> > > -       struct vmcb_ctrl_area_cached ctl;
> > > -
> > >         /*
> > > -        * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > -        * valid within nested_svm_vmrun.
> > > +        * An opaque, read-only cache of vmcb12 controls, used to query L1's
> > > +        * controls while running L2, e.g. to route intercepts appropriately.
> > > +        * All reads are routed through accessors to make it all but impossible
> > > +        * for KVM to clobber its snapshot of vmcb12.
> > >          */
> > > -       struct vmcb_save_area_cached save;
> > 
> > Is dropping the cached save area intentional?
> 
> Yes, I think we should try to drop it.  Assuming the comment is correct and it
> really only 
> 
> > We can drop it and make it a local vaiable in nested_svm_vmrun(), and
> > plumb it all the way down. But it could be too big for the stack.
> 
> It's 48 bytes, there's no way that's too big.

That's before my hardening series shoved everything in there. It's now
256 bytes, which is not huge, but makes me nervous. Especially that it
may grow more in the future.

> > Allocating it every time isn't nice either.
> 
> > Do you mean to also make it opaque?
> 
> I'd prefer to drop it.

Me too, but I am nervous about putting it on the stack.

> 
> > > +       u8 __vmcb12_ctrl[sizeof(struct vmcb_ctrl_area_cached)];
> > 
> > We have a lot of accesses to svm->nested.ctl, so we'll need a lot of
> > clutter to cast the field in all of these places.
> > 
> > Maybe we add a read-only accessor that returns a pointer to a constant
> > struct?
> 
> That's what I said :-D
> 
> 	* All reads are routed through accessors to make it all but impossible
> 	* for KVM to clobber its snapshot of vmcb12.
> 
> There might be a lot of helpers, but I bet it's less than nVMX has for vmcs12.

Oh I meant instead of having a lot of helpers, have a single helper that
returns it as a pointer to const struct vmcb_ctrl_area_cached? Then all
current users just switch to the helper instead of directly using
svm->nested.ctl.

We can even name it sth more intuitive like svm_cached_vmcb12_control().

> 
> > >         bool initialized;
> > > 
> > > > > Part of me wants to remove these two fields entirely:
> > > > > 
> > > > > 	/* cache for control fields of the guest */
> > > > > 	struct vmcb_ctrl_area_cached ctl;
> > > > > 
> > > > > 	/*
> > > > > 	 * Note: this struct is not kept up-to-date while L2 runs; it is only
> > > > > 	 * valid within nested_svm_vmrun.
> > > > > 	 */
> > > > > 	struct vmcb_save_area_cached save;
> > > > > 
> > > > > and instead use "full" caches only for the duration of nested_svm_vmrun().  Or
> > > > > hell, just copy the entire vmcb12 and throw the cached structures in the garbage.
> > > > > But that'll probably end in a game of whack-a-mole as things get moved back in.
> > > > 
> > > > Yeah, KVM needs to keep some of the fields around :/
> > > 
> > > For me, that's totally fine.  As above, the problem I see is that there is no
> > > single source of truth, i.e. that the authoritative state is spread across vmcb02
> > > and the cache.
> > > 
> > > > > So rather than do something totally drastic, I think we should kill
> > > > > nested_copy_vmcb_cache_to_control() and replace it with a "save control" flow.
> > > > > And then have it share code as much code as possible with nested_svm_vmexit(),
> > > > > and fixup nested_svm_vmexit() to not pull from svm->nested.ctl unnecessarily.
> > > > > Which, again AFICT, is pretty much limited to int_ctl: either vmcb02 is
> > > > > authoritative, or KVM shouldn't be updating vmcb12, and so only the "save control"
> > > > > for KVM_GET_NESTED_STATE needs to copy from the cache to the migrated vmcb12.
> > > > 
> > > > I think this works if we draw a clear extinction between "in","out", and
> > > > "in/out" fields, which is not great because some fields (like int_ctl)
> > > > have different directions for different bits :/
> > > > 
> > > > But if we do draw that distinction, and have helpers that copy fields
> > > > based on direction, things become more intuitive:
> > > > 
> > > > During nested VMRUN, we use the "in" and "in/out" fields from cached
> > > > vmcb12 to construct vmcb02 through nested_vmcb02_prepare_control().
> > > > 
> > > > During save, we save "in" fields from the cached vmcb12, "out" and
> > > > "in/out" fields from vmcb02.
> > > > 
> > > > During restore, we use the "in" and "in/out" fields from the restored
> > > > payload to construct vmcb02 through nested_vmcb02_prepare_control(), AND
> > > > update the "out" fields as well from the payload.
> > > 
> > > Why the last part?  If L2 is active, then the pure "out" fields are guaranteed
> > > to be written on nested #VMEXIT.  Anything else simply can't work.
> > 
> > I think it just so happens that all pure "out" fields are consumed by
> > KVM before save+restore is possible, but it is possible for an "out"
> > field to be used by KVM at a later point, or copied from vmcb02 to
> > vmcb12 during nested #VMEXIT (e.g. if KVM exits to L1 directly after
> > save+restore, before running L2).
> 
> Ya.
> 
> > next_rip and int_state fall in this bucket, it just happens to also be
> > an "in" field.
> > 
> > For example, if support for decode assists is added, there will be cases
> > where KVM just copies insn_bytes from vmcb02 to vmcb12 on nested
> > #VMEXIT. If insn_bytes is lost on save+restore, and KVM immediately
> > exits to L1 after restore, insn_bytes is lost.
> > 
> > So we need to also save+restore pure "out" fields, which we do not do
> > today.
> 
> Hmm, strictly speaking, no.  We'd be fixing a bug that doesn't exist, yet.  But
> the word yet...
> 
> > > > During synthesized #VMEXIT, we save the "out" and "in/out" fields from
> > > > vmcb02 (shared part with save/restore).
> > > 
> > > Yeah, that all works.  We could also treat save() as an extension of #VMEXIT, but
> > > that could make KVM_GET_NESTED_STATE non-idempotent (which might already be the
> > > case for VMX?).  I.e. we could sync vmcb02 to vmcb12 (cache), and then copy that
> > > to userspace.
> > 
> > I think this will require adding more fields to the cache, but wait, we
> > already have a lot of "out" fields there but I don't think they are
> > being used at all..
> > 
> > Anyway, this may make things simpler. Instead of pulling different
> > fields from either cached vmcb12 and vmcb02, we always combine them
> > first. I will keep that in mind.
> > 
> > > 
> > > > The save/restore changes would need a flag to avoid restoring garbage
> > > > from an older KVM.
> > > 
> > > I don't follow.  I was thinking we'd only change how KVM maintains authoritative
> > > state while runnign L2, i.e. not make any changes (other than fixes) to the
> > > serialized state for save/restore.
> > 
> > I thought we're not currently saving "out" fields at all, but
> > apparently we are, we just do not use them in svm_set_nested_state(). So
> > we probably do not need a flag. Even if some fields are not currently
> > copied, I assume KVM restoring garbage from an older KVM is no worse
> > than having uninitialized garbage :)
> 
> Heh, yep.
> 
> > I think this will be annoying when new fields are added, like
> > insn_bytes. Perhaps at some point we move to just serializing the entire
> > combined vmcb02/vmcb12 control area and add a flag for that.
> 
> If we do it now, can we avoid the flag?

I don't think so. Fields like insn_bytes are not currently serialized at
all. The moment we need them, we'll probably need to add a flag, at
which point serializing everything under the flag would probably be the
sane thing to do.

That being said, I don't really know how a KVM that uses insn_bytes
should handle restoring from an older KVM that doesn't serialize it :/

Problem for the future, I guess :)