linux-kernel - RE: [V5 PATCH 2/4] panic/x86: Allow cpus to save registers even if they are looping in NMI context

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <04EAB7311EE43145B2D3536183D1A84454A2B929@GSjpTKYDCembx31.service.hitachi.net>
Date:	Wed, 25 Nov 2015 09:46:37 +0000
From:	河合英宏 / KAWAI，HIDEHIRO 
	<hidehiro.kawai.ez@...achi.com>
To:	"'Borislav Petkov'" <bp@...en8.de>
CC:	Jonathan Corbet <corbet@....net>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Vivek Goyal <vgoyal@...hat.com>, Baoquan He <bhe@...hat.com>,
	"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
	"x86@...nel.org" <x86@...nel.org>,
	"kexec@...ts.infradead.org" <kexec@...ts.infradead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Michal Hocko <mhocko@...nel.org>,
	Ingo Molnar <mingo@...hat.com>,
	平松雅巳 / HIRAMATU，MASAMI 
	<masami.hiramatsu.pt@...achi.com>
Subject: RE: [V5 PATCH 2/4] panic/x86: Allow cpus to save registers even if
 they are looping in NMI context

> On Wed, Nov 25, 2015 at 05:51:59AM +0000, 河合英宏 / KAWAI，HIDEHIRO wrote:
> > > > `Infinite loop in NMI context' can happen:
> > > >
> > > >   a. when a cpu panics on NMI while another cpu is processing panic
> > > >   b. when a cpu received an external or unknown NMI while another
> > > >      cpu is processing panic on NMI
> > > >
> > > > In the case of a, it loops in panic_smp_self_stop().  In the case
> > > > of b, it loops in raw_spin_lock() of nmi_reason_lock.
> > >
> > > Please describe those two cases more verbosely - it takes slow people
> > > like me a while to figure out what exactly can happen.
> >
> >   a. when a cpu panics on NMI while another cpu is processing panic
> >      Ex.
> >      CPU 0                     CPU 1
> >      =================         =================
> >      panic()
> >        panic_cpu <-- 0
> >        check panic_cpu
> >        crash_kexec()
> >                                receive an unknown NMI
> >                                unknown_nmi_error()
> >                                  nmi_panic()
> >                                    panic()
> >                                      check panic_cpu
> >                                      panic_smp_self_stop()
> >                                        infinite loop in NMI context
> >
> >   b. when a cpu received an external or unknown NMI while another
> >      cpu is processing panic on NMI
> >      Ex.
> >      CPU 0                     CPU 1
> >      ======================    ==================
> >      receive an unknown NMI
> >      unknown_nmi_error()
> >        nmi_panic()             receive an unknown NMI
> >          panic_cpu <-- 0       unknown_nmi_error()
> >          panic()                 nmi_panic()
> >            check panic_cpu         panic
> >            crash_kexec()             check panic_cpu
> >                                      panic_smp_self_stop()
> >                                        infinite loop in NMI context
> 
> Ok, that's what I saw too, thanks for confirming.
> 
> But please write those examples with the *old* code in the commit
> message, i.e. without panic_cpu and nmi_panic() which you're adding.

Does *old* code mean the code without this patch *series* ?
panic_cpu and nmi_panic() are introduced by PATCH 1/4, not this patch.

> Basically, you want to structure your commit message this way:
> 
> This is the problem the current code has: ...
> 
> But we need to do this: ...
> 
> We fix it by doing that: ...

Good suggestion!  I'll revise a bit with following your comment.

> > > > + * directly.  This function is used when we have already been in NMI handler.
> > > > + */
> > > > +void poll_crash_ipi_and_callback(struct pt_regs *regs)
> > >
> > > Why "poll"? We won't return from crash_nmi_callback() if we're not the
> > > crashing CPU.
> >
> > This function polls that crash IPI has been issued by checking
> > crash_ipi_done, then invokes the callback.  This is different
> > from so-called "poll" functions.  Do you have some good name?
> 
> Maybe something as simple as "run_crash_callback"?

I prefer this, but we might want to add some more prefix or suffix.
For example, "conditionally_run_crash_nmi_callback".

> Or since we're calling it from other places, maybe add the "crash"
> prefix:
> 
> 	while (!raw_spin_trylock(&nmi_reason_lock))
> 		crash_run_callback(regs);
> 
> Looks even better to me in code context. :)

Thanks for your deep review!

--
Hidehiro Kawai
Hitachi, Ltd. Research & Development Group