linux-kernel - Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YzRvMZDoukMbeaxR@google.com>
Date:   Wed, 28 Sep 2022 15:58:41 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Paolo Bonzini <pbonzini@...hat.com>
Cc:     Emanuele Giuseppe Esposito <eesposit@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Maxim Levitsky <mlevitsk@...hat.com>, kvm@...r.kernel.org,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        "H. Peter Anvin" <hpa@...or.com>, linux-kernel@...r.kernel.org,
        Like Xu <like.xu.linux@...il.com>
Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates

On Wed, Sep 28, 2022, Paolo Bonzini wrote:
> On 9/27/22 17:58, Sean Christopherson wrote:
> > I'm pretty sure this patch will Just Work for QEMU, because QEMU simply resumes
> > the vCPU if mmio.len==0.  It's a bit of a hack, but I don't think it violates KVM's
> > ABI in any way, and it can even become "official" behavior since KVM x86 doesn't
> > otherwise exit with mmio.len==0.
> 
> I think this patch is not a good idea for two reasons:
> 
> 1) we don't know how userspace behaves if mmio.len is zero.  It is of course
> reasonable to do nothing, but an assertion failure is also a valid behavior

Except that KVM currently does neither.  If the fetch happens at CPL>0 and/or in
L2, KVM injects #UD.  That's flat out architecturally invalid.  If it's a sticking
point, the mmio.len==0 hack can be avoided by defining a new exit reason.

> 2) more important, there is no way to distinguish a failure due to the guest
> going in the weeds (and then KVM_EXIT_INTERNAL_ERROR is fine) from one due
> to the KVM_SET_USER_MEMORY_REGION race condition.  So this will cause a
> guest that correctly caused an internal error to loop forever.

Userspace has the GPA and absolutely should be able to detect if the MMIO may have
been due to its memslot manipulation versus the guest jumping into the weeds.

> While the former could be handled in a "wait and see" manner, the latter in
> particular is part of the KVM_RUN contract.  Of course it is possible for a
> guest to just loop forever, but in general all of KVM, QEMU and upper
> userspace layers want a crashed guest to be detected and stopped forever.
> 
> Yes, QEMU could loop only if memslot updates are in progress, but honestly
> all the alternatives I have seen to atomic memslot updates are really
> *awful*.  David's patches even invent a new kind of mutex for which I have
> absolutely no idea what kind of deadlocks one should worry about and why
> they should not exist; QEMU's locking is already pretty crappy, it's
> certainly not on my wishlist to make it worse!
> 
> This is clearly a deficiency in the KVM kernel API, and (thanks to SRCU) the
> kernel is the only place where you can have a *good* fix.  It should have
> been fixed years ago.

I don't disagree that the memslots API is lacking, but IMO that is somewhat
orthogonal to fixing KVM x86's "code fetch to MMIO" mess.  Such a massive new API
should be viewed and prioritized as a new feature, not as a bug fix, e.g. I'd
like to have the luxury of being able to explore ideas beyond "let userspace
batch memslot updates", and I really don't want to feel pressured to get this
code reviewed and merge.

E.g. why do a batch update and not provide KVM_SET_ALL_USER_MEMORY_REGIONS to
do wholesale replacement?  That seems like it would be vastly simpler to handle
on KVM's end.  Or maybe there's a solution in the opposite direction, e.g. an
API that allows 1->N or N->1 conversions but not arbitrary batching.

And just because QEMU's locking is "already pretty crappy", that's not a good
reason to drag KVM down into the mud.  E.g. taking a lock and conditionally
releasing it...  I get that this is an RFC, but IMO anything that requires such
shenanigans simply isn't acceptable.

  /*
   * Takes kvm->slots_arch_lock, and releases it only if
   * invalid_slot allocation, kvm_prepare_memory_region failed
   * or batch->is_move_delete is true.
   */
  static int kvm_prepare_memslot(struct kvm *kvm,
			         struct kvm_internal_memory_region_list *batch)