linux-kernel - Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACw3F51FbzkkX_DcCVCieZ=408oP_Fy3sXYk5AjWRX3RJO2Fzg@mail.gmail.com>
Date: Sun, 3 Nov 2024 21:02:03 -0800
From: Jiaqi Yan <jiaqiyan@...gle.com>
To: Marc Zyngier <maz@...nel.org>
Cc: oliver.upton@...ux.dev, joey.gouly@....com, suzuki.poulose@....com, 
	yuzenghui@...wei.com, catalin.marinas@....com, will@...nel.org, 
	pbonzini@...hat.com, linux-arm-kernel@...ts.infradead.org, 
	kvmarm@...ts.linux.dev, kvm@...r.kernel.org, duenwen@...gle.com, 
	rananta@...gle.com, James Houghton <jthoughton@...gle.com>, 
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA

Hi Marc, thanks for your quick response!

On Fri, Nov 1, 2024 at 6:54 AM Marc Zyngier <maz@...nel.org> wrote:
>
> On Thu, 31 Oct 2024 21:21:04 +0000,
> Jiaqi Yan <jiaqiyan@...gle.com> wrote:
> >
> > Currently KVM handles SEA in guest by injecting async SError into
> > guest directly, bypassing VMM, usually results in guest kernel panic.
> >
> > One major situation of guest SEA is when vCPU consumes uncorrectable
> > memory error on the physical memory. Although SError and guest kernel
> > panic effectively stops the propagation of corrupted memory, it is not
> > easy for VMM and guest to recover from memory error in a more graceful
> > manner.
> >
> > Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> > how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> > thread.
>
> Can you elaborate on why the delivery of a signal is preferable to
> simply exiting back to userspace with a description of the error?
> Signals are usually not generated by KVM, and are a pretty twisted way
> to generate an exit.

A couple of reasons. First, my intuition is that KVM and core kernel
(do_sea) should have aligned behavior when APEI failed to claim the
SEA. Second, if we only talk about SEA caused by memory poison
consumption, both arm64 and x86 KVM already send SIGBUS to VMM/vCPU
thread (kvm_send_hwpoison_signal) to signal hardware memory failure,
although the situation is slightly different here, where we have a
hardware event, versus a HWPoison flag check or VM_FAULT_HWPOISON
returned. But from VMM/vCPU's perspective, hardware event or software
level VM_FAULT_HWPOISON, it would be nice if it can react to just the
same event, the SIGBUS signal.

And there is another reason around your comment on arm64_notify_die.

By "exiting back to userspace with a description of the error", are
you suggesting KVM_EXIT_MEMORY_FAULT? If so, we may need to add a new
flag to tell VMM the error is hardware memory poison, which could be
KVM_MEMORY_EXIT_FLAG_USERFAULT[1] if we don't want a specific one (but
I think a specific flag for hwpoison is probably clearer).

>
> > In addition to the benifit that KVM's handling for SEA becomes aligned
> > with core kernel behavior
> > - The blast radius in VM can be limited to only the consuming thread
> >   in guest, instead of entire guest kernel, unless the consumption is
> >   from guest kernel.
> > - VMM now has the chance to do its duties to stop the VM from repeatedly
> >   consuming corrupted data. For example, VMM can unmap the guest page
> >   from stage-2 table to intercept forseen memory poison consumption,
>
> Not quite. The VMM doesn't manage stage-2. It can remove the page from
> the VMA if it has it mapped, but that's it. The kernel deals with S2.

I should probably not mention the implementation, "unmap from S2".
What is needed for preventing repeated consuming memory poison is
simply preventing guest access to certain memory pages. There is a
work in progress KVM API [1] by my colleague James.

[1] https://lpc.events/event/18/contributions/1757/attachments/1442/3073/LPC_%20KVM%20Userfault.pdf

>
> Which brings me to the next subject: when the kernel unmaps the page
> at S2, it is likely to use CMOs. Can these CMOs create RAS error
> themselves?

I assume CMO here means writing dirty cache lines to memory. Writing
something new to a poisoned cacheline usually won't cause RAS error.
Notifying memory poison usually is delayed to a memory load
transaction.

>
> >   and for every consumption injects SEA to EL1 with synthetic memory
> >   error CPER.
>
> Why do we need to involve ACPI here? I would expect the production of
> an architected error record instead. Or at least be given the option.

Sorry, I was just mentioning a specific VMM's implementation. There
are definitely multiple options (Machine Check MSRs vs CPER for error
description data, SEA vs SDEI vs SPI for notification mechanisms) for
how VMM involves the guest to handle memory error. My preference is:
VMM populates CPER in guest HEST when VMM instructs KVM to inject
i/dabt to the guest.

And a word about "be given the option": I think when VMM receives
SIGBUS with si_addr=faulted/poisoned HVA, it's got all these options,
like using the si_addr to construct CPER with poisoned guest physical
address, or mci_address MSR.

>
> > Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> > can opt in this new capability if it prefers SIGBUS than SError
> > injection during VM init. Now SEA handling in KVM works as follows:
> > 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> > 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> >    enabled for the VM, and the SEA is NOT about translation table,
> >    send SIGBUS BUS_OBJERR signal with host virtual address.
>
> And what if it is? S1 PTs are backed by userspace memory, like
> anything else. I don't think we should have a different treatment of
> those, because the HW wouldn't treat them differently either.

You are talking about ESR_ELx_FSC_SEA_TTW(1), or
ESR_ELx_FSC_SEA_TTW(0), right? I think you are right, S1 is no
difference.

But I think we want to make an exception for SEA about S2 PTs.

>
> > 3. Otherwise directly inject async SError to guest.
> >
> > Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> > that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> > in VM allocated some memory buffer. The test used EINJ to inject an
> > uncorrectable recoverable memory error at a page in the allocated memory
> > buffer. The dummy application then consumed the memory error. Some hack
> > was done to make core kernel's memory_failure triggered by poison
> > generation to fail, so KVM had to deal with the SEA guest abort due to
> > poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> > valid host virtual address of the poisoned page. VMM then injected a SEA
> > into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> > the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> > guest survived and continued to run.
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@...gle.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h |  2 +
> >  arch/arm64/include/asm/kvm_ras.h  | 20 ++++----
> >  arch/arm64/kvm/Makefile           |  2 +-
> >  arch/arm64/kvm/arm.c              |  5 ++
> >  arch/arm64/kvm/kvm_ras.c          | 77 +++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/mmu.c              |  8 +---
> >  include/uapi/linux/kvm.h          |  1 +
> >  7 files changed, 98 insertions(+), 17 deletions(-)
> >  create mode 100644 arch/arm64/kvm/kvm_ras.c
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index bf64fed9820ea..eb37a2489411a 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -334,6 +334,8 @@ struct kvm_arch {
> >       /* Fine-Grained UNDEF initialised */
> >  #define KVM_ARCH_FLAG_FGU_INITIALIZED                        8
> >       unsigned long flags;
> > +     /* Instead of injecting SError into guest, SIGBUS VMM */
> > +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA                  9
>
> nit: why do you put this definition out of sequence (below 'flags')?

Ah, I will move it on top of flags.

>
> >
> >       /* VM-wide vCPU feature set */
> >       DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> > diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> > index 87e10d9a635b5..4bb7a424e3f6c 100644
> > --- a/arch/arm64/include/asm/kvm_ras.h
> > +++ b/arch/arm64/include/asm/kvm_ras.h
> > @@ -11,15 +11,17 @@
> >  #include <asm/acpi.h>
> >
> >  /*
> > - * Was this synchronous external abort a RAS notification?
> > - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > + * Handle synchronous external abort (SEA) in the following order:
> > + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> > + *    are all done.
> > + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> > + *    about translation table, send SIGBUS
> > + *    - si_code is BUS_OBJERR.
> > + *    - si_addr will be 0 when accurate HVA is unavailable.
> > + * 3. Otherwise, directly inject an async SError to guest.
> > + *
> > + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> >   */
> > -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> > -{
> > -     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > -     lockdep_assert_irqs_enabled();
> > -
> > -     return apei_claim_sea(NULL);
> > -}
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
> >
> >  #endif /* __ARM64_KVM_RAS_H__ */
> > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > index 3cf7adb2b5038..c4a3a6d4870e6 100644
> > --- a/arch/arm64/kvm/Makefile
> > +++ b/arch/arm64/kvm/Makefile
> > @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> >        vgic/vgic-v3.o vgic/vgic-v4.o \
> >        vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> >        vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> > -      vgic/vgic-its.o vgic/vgic-debug.o
> > +      vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
> >
> >  kvm-$(CONFIG_HW_PERF_EVENTS)  += pmu-emul.o pmu.o
> >  kvm-$(CONFIG_ARM64_PTR_AUTH)  += pauth.o
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 48cafb65d6acf..bb97ad678dbec 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> >               }
> >               mutex_unlock(&kvm->slots_lock);
> >               break;
> > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> > +             r = 0;
> > +             set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
>
> Shouldn't this be somehow gated on the VM being RAS aware?

Do you mean a CAP that VMM can tell KVM the VM guest has RAS ability?
I don't know if there is one for arm64. On x86 there is
KVM_X86_SETUP_MCE. KVM_CAP_ARM_INJECT_EXT_DABT maybe a revelant one
but I don't think it is exactly the one for "RAS ability".

>
> > +             break;
> >       default:
> >               break;
> >       }
> > @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >       case KVM_CAP_ARM_SYSTEM_SUSPEND:
> >       case KVM_CAP_IRQFD_RESAMPLE:
> >       case KVM_CAP_COUNTER_OFFSET:
> > +     case KVM_CAP_ARM_SIGBUS_ON_SEA:
> >               r = 1;
> >               break;
> >       case KVM_CAP_SET_GUEST_DEBUG2:
> > diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> > new file mode 100644
> > index 0000000000000..3225462bcbcda
> > --- /dev/null
> > +++ b/arch/arm64/kvm/kvm_ras.c
> > @@ -0,0 +1,77 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/bitops.h>
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/kvm_emulate.h>
> > +#include <asm/kvm_ras.h>
> > +#include <asm/system_misc.h>
> > +
> > +/*
> > + * For synchrnous external instruction or data abort, not on translation
> > + * table walk or hardware update of translation table, is FAR_EL2 valid?
> > + */
> > +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> > +{
> > +     return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> > +}
> > +
> > +/*
> > + * Was this synchronous external abort a RAS notification?
> > + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> > + */
> > +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> > +{
> > +     /* apei_claim_sea(NULL) expects to mask interrupts itself */
> > +     lockdep_assert_irqs_enabled();
> > +     return apei_claim_sea(NULL);
> > +}
> > +
> > +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> > +{
> > +     bool sigbus_on_sea;
> > +     int idx;
> > +     u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> > +     u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> > +     phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> > +     unsigned long hva = 0UL;
> > +
> > +     /*
> > +      * For RAS the host kernel may handle this abort.
> > +      * There is no need to SIGBUS VMM, or pass the error into the guest.
> > +      */
> > +     if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> > +             return;
> > +
> > +     sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> > +                              &(vcpu->kvm->arch.flags));
> > +
> > +     /*
> > +      * In addition to userspace opt-in, SIGBUS only makes sense if the
> > +      * abort is NOT about translation table walk and NOT about hardware
> > +      * update of translation table.
> > +      */
> > +     sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> > +
> > +     /* Pass the error directly into the guest. */
> > +     if (!sigbus_on_sea) {
> > +             kvm_inject_vabt(vcpu);
> > +             return;
> > +     }
> > +
> > +     if (kvm_vcpu_sea_far_valid(vcpu)) {
> > +             idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +             hva = gfn_to_hva(vcpu->kvm, gfn);
> > +             srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +     }
> > +
> > +     /*
> > +      * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> > +      * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> > +      * does if apei_claim_sea() fails.
> > +      */
> > +     arm64_notify_die("synchronous external abort",
> > +                      current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
>
> This is the point where I really think we should simply trigger an
> exit with all that syndrome information stashed in kvm_run, like any
> other event requiring userspace help.

Ah, there is another reason SIGBUS is better than kvm exit: "It is a
programming error to set ext_dabt_pending after an exit which was not
either KVM_EXIT_MMIO or KVM_EXIT_ARM_NISV", from
Documentation/virt/kvm/api.rst. So if VMM is allowed to inject data
abort to guest, at least current documentation doesn't suggest kvm
exit is feasible.

>
> Also: where is the documentation?

Once I get more positive feedback and send out PATCH instead of RFC, I
can add to Documentation/virt/kvm/api.rst.

>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.