linux-kernel - Re: [PATCH 03/19] Revert "KVM: SVM: Introduce hybrid-AVIC mode"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yw+yjo4TMDYnyAt+@google.com>
Date:   Wed, 31 Aug 2022 19:12:14 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Maxim Levitsky <mlevitsk@...hat.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
        Li RongQing <lirongqing@...du.com>
Subject: Re: [PATCH 03/19] Revert "KVM: SVM: Introduce hybrid-AVIC mode"

On Wed, Aug 31, 2022, Maxim Levitsky wrote:
> On Wed, 2022-08-31 at 16:29 +0000, Sean Christopherson wrote:
> > On Wed, Aug 31, 2022, Maxim Levitsky wrote:
> > > Leaving AVIC on, when vCPU is in x2apic mode cannot trigger extra MMIO emulation,
> > > in fact the opposite - because AVIC is on, writes to 0xFEE00xxx might *not* trigger
> > > MMIO emulation and instead be emulated by AVIC.
> > 
> > That's even worse, because KVM is allowing the guest to exercise hardware logic
> > that I highly doubt AMD has thoroughly tested.
> 
> Harware logic is exactly the same regarless of if KVM uses x2apic mode or not,
> and it is better to be prepared for all kind of garbage coming from the guest.

Right, I got twisted around and was thinking x2APIC enabling of the guest was
reflected in hardware, but it's purely emulated in this mode.

> Software logic, I can understand you, there could be registers that trap differently
> in avic and x2avic mode, but it should be *very* easy to deal with it, the list
> of registers that trap is very short.
> 
> > 
> > > Yes, some of these writes can trigger AVIC specific emulation vm exits, but they
> > > are literaly the same as those used by x2avic, and it is really hard to see
> > > why this would be dangerous (assuming that x2avic code works, and avic code
> > > is aware of this 'hybrid' mode).
> > 
> > The APIC_RRR thing triggered the KVM_BUG_ON() in kvm_apic_write_nodecode()
> > precisely because of the AVIC trap.  At best, this gives a way for the guest to
> > trigger a WARN_ON_ONCE() and thus panic the host if panic_on_warn=1.  I fixed
> > the APIC_RRR case because that will be problematic for x2AVIC, but there are
> > other APIC registers that are unsupported in x2APIC that can trigger the KVM_BUG_ON().

> > > From the guest point of view, unless the guest pokes at random MMIO area,
> > > the only case when this matters is if the guest maps RAM over the 0xFEE00xxx
> > > (which it of course can, the spec explictly state as you say that when x2apic
> > > is enabled, the mmio is disabled), and then instead of functioning as RAM,
> > > the range will still function as APIC.
> > 
> > There is no wiggle room here though, KVM is blatantly breaking the architectural
> > specification.  When x2APIC is enabled, the xAPIC MMIO does not exist.
> 
> In this case I say that there is no wiggle room for KVM to not allow
> different APIC bases on each CPU - the spec 100% allows it, but in KVM it is
> broken.

The difference is that KVM is consistent with respect to itself in that case.
KVM doesn't support APIC base relocation, and never has supported APIC base
relocation.

This hybrid AVIC mode means that the resulting KVM behavior will vary depending
on whether or not AVIC is supported and enabled, whether or not x2AVIC is supported,
and also on internal KVM state.  And we even have tests for this!  E.g. run the APIC
unit test with AVIC disabled and it passes, run it with AVIC enabled and it fails.

> If you are really hell bent on not having that MMIO exposed, then I say we
> can just disable the AVIC memslot, and keep AVIC enabled in this case - this
> should make us both happy.

I don't think that will work though, as I don't think it's possible to tell hardware
not to accelerate AVIC accesses.  I.e. KVM can squash the unaccelerated traps, but
anything that is handled by hardware will still go through.

> This discussion really makes me feel that you want just to force your opinion
> on others, and its not the first time this happens. It is really frustrating
> to work like that.  It might sound harsh but this is how I feel. Hopefully I
> am wrong.

Yes and no.  I am most definitely trying to steer KVM in a direction that I feel
will allow KVM to scale in terms of developers, users, and workloads.  Part of
that direction is to change people's mindset from "good enough" to "implement to
the spec".

KVM's historic "good enough" approach largely worked when KVM had a relatively
small and stable development community, and when KVM was being used to run a
relatively limited set of workloads.  But KVM is being used to run an ever increasing
variety of workloads, and the development community is larger (and I hope that we
can grow it even further).  And there is inevitably attrition, which means that
unless we are extremely diligent in documenting KVM's quirks/errata, knowledge of
KVM's "good enough" shortcuts will eventually be lost.

E.g. it took me literally days to understand KVM's hack for forcing #PF=>#PF=>VM-Exit
instead of #PF=>#PF=>#DF.  That mess would have been avoided if KVM had implemented
to the spec and actually done the right thing back when the bug was first encountered.
Ditto for the x2APIC ID hotplug hack; I stared at that code on at least three
different occassions before I finally understood the full impact of the hack.

And getting KVM to implement to the spec means not deviating from that path when
it's inconvenient to follow, thus the hard line I am drawing.  I am sure we'll
encounter scenarios/features where it's simply impossible for KVM to do the right
thing, e.g. I believe virtualizing VMX's posted interrupts falls into this category.
But IMO those should be very, very rare exceptions.

So, "yes" in the sense that I am openly trying to drag people into alignment with
my "implement to the spec" vision.  But "no" in the sense that I don't think it's
fair to say I am forcing my opinion on others.  I am not saying "NAK" and ending
the discussion.