linux-kernel - Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210126223605.GA14355@agluck-desk2.amr.corp.intel.com>
Date:   Tue, 26 Jan 2021 14:36:05 -0800
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     x86@...nel.org, Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Darren Hart <dvhart@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v5] x86/mce: Avoid infinite loop for copy from user
 recovery

On Tue, Jan 26, 2021 at 12:03:14PM +0100, Borislav Petkov wrote:
> On Mon, Jan 25, 2021 at 02:55:09PM -0800, Luck, Tony wrote:
> > And now I've changed it back to non-atomic (but keeping the
> > slightly cleaner looking code style that I used for the atomic
> > version).  This one also works for thousands of injections and
> > recoveries.  Maybe take it now before it stops working again :-)
> 
> Hmm, so the only differences I see between your v4 and this are:
> 
> -@@ -1238,6 +1238,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
> +@@ -1238,6 +1238,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
>   
>   static void kill_me_now(struct callback_head *ch)
>   {
> ++	struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
> ++
>  +	p->mce_count = 0;
>   	force_sig(SIGBUS);
>   }
> 
> Could the container_of() macro have changed something?

That change was to fix my brown paper bag moment (does not
compile without a variable named "p" in scope to be used on
next line.)

> Because we don't know yet (right?) why would it fail? Would it read
> stale ->mce_count data? If so, then a barrier is missing somewhere.

I don't see how a barrier would make a differece. In the common case
all this code is executed on the same logical CPU. Return from the
do_machine_check() tries to return to user mode and finds that there
is some "task_work" to execute first.

In some cases Linux might context switch to something else. Perhaps
this task even gets picked up by another CPU to run the task work
queued functions.  But I imagine that the context switch should act
as a barrier ... shouldn't it?

> Or what is the failure exactly?

After a few cycles of the test injection to user mode, I saw an
overflow in the machine check bank. As if it hadn't been cleared
from the previous iteration ... but all the banks are cleared as
soon as we find that the machine check is recoverable. A while before
getting to the code I changed.

When the tests were failing, code was on top of v5.11-rc3. Latest
experiments moved to -rc5.  There's just a tracing fix from
PeterZ between rc3 and rc5 to mce/core.c:

737495361d44 ("x86/mce: Remove explicit/superfluous tracing")

which doesn't appear to be a candidate for the problems I saw.

> Because if I take it now without us knowing what the issue is, it will
> start failing somewhere - Murphy's our friend - and then we'll have to
> deal with breaking people's boxes. Not fun.

Fair point.

> The other difference is:
> 
> @@ -76,8 +71,10 @@ index 13d3f1cbda17..5460c146edb5 100644
>  -	current->mce_kflags = m->kflags;
>  -	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
>  -	current->mce_whole_page = whole_page(m);
> ++	int count = ++current->mce_count;
> ++
>  +	/* First call, save all the details */
> -+	if (current->mce_count++ == 0) {
> ++	if (count == 1) {
>  +		current->mce_addr = m->addr;
>  +		current->mce_kflags = m->kflags;
>  +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> 
> Hmm, a local variable and a pre-increment. Can that have an effect somehow?

This is the bit that changed during my detour using atomic_t mce_count.
I added the local variable to capture value from atomic_inc_return(), then
used it later, instead of a bunch of atomic_read() calls.

I kept it this way because "if (count == 1)" is marginally easier to read
than "if (current->mce_count++ == 0)"

> > +	/* Ten is likley overkill. Don't expect more than two faults before task_work() */
> 
> Typo: likely.

Oops. Fixed.

-Tony