linux-kernel - Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <0184EA26B2509940AA629AE1405DD7F201FFC1EF@dggema523-mbx.china.huawei.com>
Date:   Sat, 15 Dec 2018 00:06:44 +0000
From:   gengdongjiu <gengdongjiu@...wei.com>
To:     Peter Maydell <peter.maydell@...aro.org>,
        James Morse <james.morse@....com>
CC:     Radim Krčmář <rkrcmar@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        Christoffer Dall <christoffer.dall@....com>,
        Marc Zyngier <marc.zyngier@....com>,
        Catalin Marinas <catalin.marinas@....com>,
        "Will Deacon" <will.deacon@....com>,
        kvm-devel <kvm@...r.kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@...r.kernel.org>,
        lkml - Kernel Mailing List <linux-kernel@...r.kernel.org>,
        arm-mail-list <linux-arm-kernel@...ts.infradead.org>
Subject: Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery
 capability to user space

> 
> On Fri, 14 Dec 2018 at 13:56, James Morse <james.morse@....com> wrote:
> >
> > Hi Dongjiu Geng,
> >
> > On 14/12/2018 10:15, Dongjiu Geng wrote:
> > > When user space do memory recovery, it will check whether KVM and
> > > guest support the error recovery, only when both of them support,
> > > user space will do the error recovery. This patch exports this
> > > capability of KVM to user space.
> >
> > I can understand user-space only wanting to do the work if host and
> > guest support the feature. But 'error recovery' isn't a KVM feature,
> > its a Linux kernel feature.
> >
> > KVM will send it's user-space a SIGBUS with MCEERR code whenever its
> > trying to map a page at stage2 that the kernel-mm code refuses this because its poisoned.
> > (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
> >
> > This is exactly the same as happens to a normal user-space process.
> >
> > I think you really want to know if the host kernel was built with
> > CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel wasn't built with that support then it will simply never deliver any
> memory failure events to QEMU, which is fine.
> 
> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory failure notifications", and so the code path which handles the SIGBUS must do
> some kind of check for whether the guest CPU is a type which expects them and that the board code set up the ACPI tables that it wants to
> fill in.

Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.

To James, I explain more to you, as peter said QEMU needs to check whether the guest CPU is a type which can handle the error though guest ACPI table. Let us see the X86's QEMU logic:
1. Before the vCPU created, it will set a default env->mcg_cap value with MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports RAS error recovery.[1]
2. when the vCPU initialize, it will check whether host kernel support this feature[2]. Only when host kernel and default env->mcg_cap value all expected this feature, then it will setup vCPU support RAS error recovery[3].
So I add this IOCTL "KVM_CAP_ARM_MEMORY_ERROR_RECOVERY" to Let QEMU check whether host/KVM support RAS error detection and recovery, only when this supports, QEMU will do the error recovery for the guest memory. 

[1]
#define MCE_CAP_DEF     (MCG_CTL_P|MCG_SER_P)
cenv->mcg_cap = MCE_CAP_DEF | MCE_BANKS_DEF |
                        (cpu->enable_lmce ? MCG_LMCE_P : 0);

[2] ret = kvm_get_mce_cap_supported(cs->kvm_state, &mcg_cap, &banks);

[3]
env->mcg_cap &= mcg_cap | MCG_CAP_BANKS_MASK;
ret = kvm_vcpu_ioctl(cs, KVM_X86_SETUP_MCE, &env->mcg_cap);

-------------------------------------For James's comments---------------------------------------------------------------------
> KVM doesn't detect these errors.
> The hardware detects them and notifies the OS via one of a number of mechanisms.
> This gets plumbed into memory_failure(), which sets a flag that the mm code uses
> to prevent the page being used again.

> KVM is only involved when it tries to map a page at stage2 and the mm code
> rejects it with -EHWPOISON. This is the same as the architectures
> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
------------------------------------------------------------------------------------------------------------------------------
James, for your above comments, I completed understand, but KVM also delivered the SIGBUS, which means KVM supports guest memory RAS error recovery, so maybe we need to tell user space this capability.

---------------------------------------------- For James's comments ---------------------------------------------------
> The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
> support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
> systems would report not-supported here, but the kernel does support this stuff.
> Just because the CPU supports this, doesn't mean the kernel was built with
> CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.
--------------------------------------------------------------------------------------------------------------------------------------
James, for your above comments[4], if you think we should not check the "cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", which do you prefer we should check?
In the X86 KVM code, it uses hardcode to tell use space the host/KVM support RAS error software recovery. If KVM does not check the " cpus_have_const_cap(ARM64_HAS_RAS_EXTN)", we have to check the hardcode as X86's method.

[4]:
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;

long kvm_arch_dev_ioctl(struct file *filp,
			unsigned int ioctl, unsigned long arg)
{
	case KVM_X86_GET_MCE_CAP_SUPPORTED: {
		r = -EFAULT;
		if (copy_to_user(argp, &kvm_mce_cap_supported,
				 sizeof(kvm_mce_cap_supported)))
			goto out;
		r = 0;
		break;
	}
}

> 
> thanks
> -- PMM