linux-kernel - [RFC PATCH v3 0/2] Deprecate BUG() in pte_list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20221128002043.1555543-1-mizhang@google.com>
Date:   Mon, 28 Nov 2022 00:20:41 +0000
From:   Mingwei Zhang <mizhang@...gle.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     "H. Peter Anvin" <hpa@...or.com>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Mingwei Zhang <mizhang@...gle.com>,
        Nagareddy Reddy <nspreddy@...gle.com>,
        Jim Mattson <jmattson@...gle.com>,
        David Matlack <dmatlack@...gle.com>
Subject: [RFC PATCH v3 0/2] Deprecate BUG() in pte_list_remove() in shadow mmu

Deprecate BUG() in pte_list_remove() in shadow mmu to avoid crashing a
physical machine. There are several reasons and motivations to do so:

MMU bug is difficult to discover due to various racing conditions and
corner cases and thus it extremely hard to debug. The situation gets much
worse when it triggers the shutdown of a host. Host machine crash might
eliminates everything including the potential clues for debugging.

>From cloud computing service perspective, BUG() or BUG_ON() is probably no
longer appropriate as the host reliability is top priority. Crashing the
physical machine is almost never a good option as it eliminates innocent
VMs and cause service outage in a larger scope. Even worse, if attacker can
reliably triggers this code by diverting the control flow or corrupting the
memory, then this becomes vm-of-death attack. This is a huge attack vector
to cloud providers, as the death of one single host machine is not the end
of the story. Without manual interferences, a failed cloud job may be
dispatched to other hosts and continue host crashes until all of them are
dead.

For the above reason, we propose the replacement of BUG() in
pte_list_remove() with KVM_BUG() to crash just the VM itself.


v2 -> v3:
 - plumb @kvm all the way to pte_list_remove() [seanjc, pbonzini]

v1 -> v2:
 - compile test the code.
 - fill KVM_BUG() with kvm_get_running_vcpu()->kvm
rfc v2:
 - https://lore.kernel.org/all/20221124003505.424617-1-mizhang@google.com/

rfc v1:
 - https://lore.kernel.org/all/20221123231206.274392-1-mizhang@google.com/


Mingwei Zhang (2):
  KVM: x86/mmu: plumb struct kvm all the way to pte_list_remove()
  KVM: x86/mmu: replace BUG() with KVM_BUG() in shadow mmu

 arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

-- 
2.38.1.584.g0f3c55d4c2-goog