linux-kernel - Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7timm7vdq4vjwn6jo5bwgtbn3f7pdtdch7l5dws76pjg7syqwb@al5mifdmboog>
Date: Wed, 18 Jun 2025 20:03:06 +0530
From: Naveen N Rao <naveen.rao@....com>
To: Sean Christopherson <seanjc@...gle.com>
CC: Marc Zyngier <maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>,
	Paolo Bonzini <pbonzini@...hat.com>, Joerg Roedel <joro@...tes.org>, "David
 Woodhouse" <dwmw2@...radead.org>, Lu Baolu <baolu.lu@...ux.intel.com>,
	<linux-arm-kernel@...ts.infradead.org>, <kvmarm@...ts.linux.dev>,
	<kvm@...r.kernel.org>, <iommu@...ts.linux.dev>,
	<linux-kernel@...r.kernel.org>, Sairaj Kodilkar <sarunkod@....com>, "Vasant
 Hegde" <vasant.hegde@....com>, Maxim Levitsky <mlevitsk@...hat.com>, "Joao
 Martins" <joao.m.martins@...cle.com>, Francesco Lavra
	<francescolavra.fl@...il.com>, David Matlack <dmatlack@...gle.com>
Subject: Re: [PATCH v3 12/62] KVM: SVM: Inhibit AVIC if ID is too big instead
 of rejecting vCPU creation

On Tue, Jun 17, 2025 at 09:10:10AM -0700, Sean Christopherson wrote:
> On Tue, Jun 17, 2025, Naveen N Rao wrote:
> > On Wed, Jun 11, 2025 at 03:45:15PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > > index ab228872a19b..f0a74b102c57 100644
> > > --- a/arch/x86/kvm/svm/avic.c
> > > +++ b/arch/x86/kvm/svm/avic.c
> > > @@ -277,9 +277,19 @@ static int avic_init_backing_page(struct kvm_vcpu *vcpu)
> > >  	int id = vcpu->vcpu_id;
> > >  	struct vcpu_svm *svm = to_svm(vcpu);
> > >  
> > > +	/*
> > > +	 * Inhibit AVIC if the vCPU ID is bigger than what is supported by AVIC
> > > +	 * hardware.  Immediately clear apicv_active, i.e. don't wait until the
> > > +	 * KVM_REQ_APICV_UPDATE request is processed on the first KVM_RUN, as
> > > +	 * avic_vcpu_load() expects to be called if and only if the vCPU has
> > > +	 * fully initialized AVIC.
> > > +	 */
> > >  	if ((!x2avic_enabled && id > AVIC_MAX_PHYSICAL_ID) ||
> > > -	    (id > X2AVIC_MAX_PHYSICAL_ID))
> > > -		return -EINVAL;
> > > +	    (id > X2AVIC_MAX_PHYSICAL_ID)) {
> > > +		kvm_set_apicv_inhibit(vcpu->kvm, APICV_INHIBIT_REASON_PHYSICAL_ID_TOO_BIG);
> > > +		vcpu->arch.apic->apicv_active = false;
> > 
> > This bothers me a bit. kvm_create_lapic() does this:
> >           if (enable_apicv) {
> >                   apic->apicv_active = true;
> >                   kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> > 	  }
> > 
> > But, setting apic->apicv_active to false here means KVM_REQ_APICV_UPDATE 
> > is going to be a no-op.
> 
> That's fine, KVM_REQ_APICV_UPDATE is a nop in other scenarios, too.  I agree it's
> not ideal, but this is a rather extreme edge case and a one-time slow path, so I
> don't think it's worth doing anything special just to avoid KVM_REQ_APICV_UPDATE.
> 
> > This does not look to be a big deal given that kvm_create_lapic() 
> > is called just a bit before svm_vcpu_create() (which calls the above 
> > function through avic_init_vcpu()) in kvm_arch_vcpu_create(), so there 
> > isn't that much done before apicv_active is toggled.
> > 
> > But, this made me wonder if introducing a kvm_x86_op to check and 
> > enable/disable apic->apicv_active in kvm_create_lapic() might be cleaner 
> > overall
> 
> Hmm, yes and no.  I completely agree that clearing apicv_active in avic.c is all
> kinds of gross, but clearing apic->apicv_active in lapic.c to handle this particular
> scenario is just as problematic, because then avic_init_backing_page() would need
> to check kvm_vcpu_apicv_active() to determine whether or not to allocate the backing
> page.  In a way, that's even worse, because setting apic->apicv_active by default
> is purely an optimization, i.e. leaving it %false _should_ work as well, it would
> just be suboptimal.  But if AVIC were to key off apic->apicv_active, 
> that could
> lead to KVM incorrectly skipping allocation of the AVIC backing page.

I understand your concern about key'ing off apic->apicv_active - that 
would definitely require a thorough audit and does add complexity to 
this.

However, as far as I can see, after your current series, we no longer 
maintain a pointer to the AVIC backing page, but instead rely on the 
lapic-allocated page.

Were you referring to the APIC access page though? That is behind 
kvm_apicv_activated() today, which looks to be problematic if there are 
inhibits set during vcpu_create() and if those can be unset later?  
Shouldn't we be allocating the apic access page unconditionally here?

> 
> > Maybe even have it be the initialization point for APICv: 
> > apicv_init(), so we can invoke avic_init_vcpu() right away?
> 
> I mostly like this idea (actually, I take that back; see below), but VMX throws
> a big wrench in things.  Unlike SVM, VMX doesn't have a singular "enable APICv"
> flag.  Rather, KVM considers "APICv" to be the combination of APIC-register
> virtualization, virtual interrupt delivery, and posted interrupts.
> 
> Which is totally fine.  The big wrench is that the are other features that interact
> with "APICv" and require allocations.  E.g.  the APIC access page isn't actually
> tied to enable_apicv, it's tied to yet another VMX feature, "virtualize APIC
> accesses" (not to be confused with APIC-register virtualization; don't blame me,
> I didn't create the control knobs/names).
> 
> As a result, KVM may need to allocate the APIC access page (not to be confused
> with the APIC *backing* page; again, don't blame me :-D) when enable_apicv=false,
> and even more confusingly, NOT allocate the APIC access page when enable_apicv=true.
> 
> 	if (cpu_need_virtualize_apic_accesses(vcpu)) {  <=== not an enable_apic check, *sigh*
> 		err = kvm_alloc_apic_access_page(vcpu->kvm);
> 		if (err)
> 			goto free_vmcs;
> 	}

Thanks for the background and the details there -- I am going to need 
some time to go unpack all of that :)

> 
> And for both SVM and VMX, IPI virtualization adds another wrinkle, in that the
> per-vCPU setup needs to fill an entry in a per-VM table.  And filling that entry
> needs to happen at the very end of vCPU creation, e.g. so that the vCPU can't be
> targeted until KVM is ready to run the vCPU.
> 
> Ouch.  And I'm pretty sure there's a use-after-free in the AVIC code.  If
> svm_vcpu_alloc_msrpm() fails, the avic_physical_id_table[] will still have a pointer
> to freed vAPIC page.

Oops.

> 
> Thus, invoking avic_init_vcpu() "right away" is unfortunately flat out unsafe
> (took me a while to realize that).
> 
> So while I 100% agree with your dislike of this patch, I think it's the least
> awful band-aid, at least in the short term.
> 
> Longer term, I'd definitely be in favor of cleaning up the related flows, but I
> think that's best done on top of this series, because I suspect it'll be somewhat
> invasive.

Sure, that makes sense. For this patch:
Acked-by: Naveen N Rao (AMD) <naveen@...nel.org>


Thanks,
Naveen