[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrUpqSpcKFRM6a1zJWebEVZxNd-5pyBW4fU19+HgcBv+2Q@mail.gmail.com>
Date: Wed, 12 Nov 2014 19:03:21 -0800
From: Andy Lutomirski <luto@...capital.net>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Oleg Nesterov <oleg@...hat.com>, Borislav Petkov <bp@...en8.de>,
X86 ML <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Peter Zijlstra <peterz@...radead.org>,
Andi Kleen <andi@...stfloor.org>
Subject: Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace
On Wed, Nov 12, 2014 at 4:31 PM, Luck, Tony <tony.luck@...el.com> wrote:
>> v2's not going to make a difference unless you're using uprobes at the
>> same time.
>
> Not (knowingly) using uprobes. System is installed with a RHEL7 userspace ... but is essentially
> idle except for my test program.
>
>> In the interest of my sanity, can you add something like
>> BUG_ON(!user_mode_vm(regs)) or the mce_panic equivalent before calling
>> memory_failure?
>
> I don't think that can possibly trip - we can only end up with a recoverable error from
> a user mode access. But I'll see about adding it anyway
>
>> What happens if there's a shared bank but the actual offender has a
>> higher order than the cpu that finds the error?
>
> This test case injects a memory error which is logged in bank1. This bank is shared by the
> two hyperthreads that are on the same core. The mce_severity() function distinguishes
> which is the active thread and which the innocent bystander by looking at MCG_STATUS.
> In the active thread MCG_STATUS.EIPV is 1, in the bystander it is 0. The returned severity
> is MCE_AR_SEVERITY for the thread that hit the error, MCE_KEEP_SEVERITY for the bystander.
> So it doesn't matter which thread has the lower order and sees it first.
>
>> Is this something I can try under KVM?
>
> I don't know if KVM has a way to simulate a machine check event.
printk seems to work just fine in do_machine_check. Any chance you
can instrument, for each cpu, all entries to do_machine_check, all
calls to do_machine_check, all returns, and everything that tries to
do memory_failure?
Also, shouldn't there be a local_irq_enable before memory_failure and
a local_irq_disable after it? It wouldn't surprise me if you've
deadlocked somewhere. Lockdep could also have something interesting
to say.
(Although I'm a bit confused. A deadlock in memory_failure shouldn't
cause the particular failure mode you're seeing, since a new #MC
should still be deliverable. Is it possible that we really need an
IRET to unmask NMIs? This seems unlikely.)
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists