lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 19 Jan 2021 15:57:59 -0800
From:   "Luck, Tony" <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     x86@...nel.org, Andrew Morton <akpm@...ux-foundation.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Darren Hart <dvhart@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [PATCH v4] x86/mce: Avoid infinite loop for copy from user
 recovery

On Tue, Jan 19, 2021 at 11:56:32AM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 03:23:46PM -0800, Luck, Tony wrote:
> > On Fri, Jan 15, 2021 at 12:51:03PM -0800, Luck, Tony wrote:
> > >  static void kill_me_now(struct callback_head *ch)
> > >  {
> > > +	p->mce_count = 0;
> > >  	force_sig(SIGBUS);
> > >  }
> > 
> > Brown paper bag time ... I just pasted that line from kill_me_maybe()
> > and I thought I did a re-compile ... but obviously not since it gives
> > 
> > error: ā€˜pā€™ undeclared (first use in this function)
> > 
> > Option a) (just like kill_me_maybe)
> > 
> > struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> > 
> > Option b) (simpler ... not sure why PeterZ did the container_of thing
> > 
> > 	current->mce_count = 0;
> 
> Right, he says it is the canonical way to get it out of callback_head.
> I don't think current will change while the #MC handler runs but we can
> adhere to the design pattern here and do container_of() ...

Ok ... I'll use the canonical way.

But now I've run into a weird issue. I'd run some basic tests with a
dozen machine checks in each of:
1) user access
2) kernel copyin
3) futex (multiple accesses from kernel before task_work())

and it passed my tests before I posted.

But the real validation folks took my patch and found that it has
destabilized cases 1 & 2 (and case 3 also chokes if you repeat a few
more times). System either hangs or panics. Generally before 100
injection/conumption cycles.

Their tests are still just doing one at a time (i.e. complete recovery
of one machine cehck before injecting the next error). So there aren't
any complicated race conditions.

So if you see anything obviously broken, let me know. Otherwise I'll
be poking around at the patch to figure out what is wrong.

-Tony

Powered by blists - more mailing lists