[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5A2E8DAC.1010400@arm.com>
Date: Mon, 11 Dec 2017 13:52:44 +0000
From: James Morse <james.morse@....com>
To: gengdongjiu <gengdongjiu@...wei.com>
CC: Will Deacon <will.deacon@....com>, mark.rutland@....com,
tbaicar@...eaurora.org, kristina.martsenko@....com,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
Huangshaoyu <huangshaoyu@...wei.com>,
Wuquanming <wuquanming@...wei.com>,
Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH RESEND] arm64: fault: avoid send SIGBUS two times
Hi gengdongjiu,
On 08/12/17 04:43, gengdongjiu wrote:
> by the way, I think also change the info.si_code to "BUS_MCEERR_AR" is better, as shown [1].
> BUS_MCEERR_AR can tell user space "Hardware memory error consumed on a error; action required".
Today its also used as the last-resort. This signal tells user-space the page
can't be re-read from disk/swap, and its been unmapped from all affected processes.
I think using it like this (tempting as it is) changes the meaning.
> so it is better than "0". In the X86 platform, it also use the "BUS_MCEERR_AR" for si_code[2] in "arch/x86/mm/fault.c".
> what do you think about it?
This is heading into kernel-first territory, I'd prefer we do that all at once
so we know everything is covered.
> [2]:
> arch/x86/mm/fault.c:
>
> static void
> do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> u32 *pkey, unsigned int fault)
> {
> ......
> #ifdef CONFIG_MEMORY_FAILURE
> if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
These VM_FAULT flags indicate memory_failure() has run, tried to re-read the
memory from disk/swap, failed, and unmapped the page from all affected processes.
> printk(KERN_ERR
> "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> tsk->comm, tsk->pid, address);
> code = BUS_MCEERR_AR;
> }
> #endif
> force_sig_info_fault(SIGBUS, code, address, tsk, pkey, fault);
> }
This is x86's page fault handler, not its Machine-Check-Exception handler.
arm64's page fault handler does this too, from do_page_fault():
> } else if (fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) {
> sig = SIGBUS;
> code = BUS_MCEERR_AR;
If you're seeing this, its likely due to the race Xie XiuQi spotted where the
recovery action has been queued, then we return to user-space before its done.
I had a go at tackling this, adding helpers to kick the assorted queues, which
we can do if we took the exception from user-space. Where I got stuck is whether
we should still force a signal, and how signals get merged. I'll try and spend
some more time on that this week.
Thanks,
James
Powered by blists - more mailing lists