lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190729205059.GA1127@roeck-us.net>
Date:   Mon, 29 Jul 2019 13:50:59 -0700
From:   Guenter Roeck <linux@...ck-us.net>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
        Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
        Borislav Petkov <bp@...en8.de>
Subject: Re: sched: Unexpected reschedule of offline CPU#2!

On Mon, Jul 29, 2019 at 12:47:45PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> > > > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > > > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > > > > [   61.348866] Call Trace:
> > > > > > [   61.349392]  kick_ilb+0x90/0xa0
> > > > > > [   61.349629]  trigger_load_balance+0xf0/0x5c0
> > > > > > [   61.349859]  ? check_preempt_wakeup+0x1b0/0x1b0
> > > > > > [   61.350057]  scheduler_tick+0xa7/0xd0
> > > > > 
> > > > > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> > > > > 
> > > > > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > > > > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > > > > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > > > > cpu_is_offline() clause of the idle loop.
> > > > > 
> > > > > However, when offline, cpu_active() should also be false, and this
> > > > > function should no-op.
> > > > 
> > > > Ha. That reboot mess is not clearing cpu active as it's not going through
> > > > the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> > > > dead in their tracks after clearing cpu online....
> > > 
> > > $string-of-cock-compliant-curses
> > > 
> > > What a trainwreck...
> > > 
> > > So if it doesn't play by the normal rules; how does it expect to work?
> > > 
> > > So what do we do? 'Fix' reboot or extend the rules?
> > 
> > Reboot has two modes:
> > 
> >  - Regular reboot initiated from user space
> > 
> >  - Panic reboot
> > 
> > For the regular reboot we can make it go through proper hotplug, 
> 
> That seems sensible.
> 
> > for the panic case not so much.
> 
> It's panic, shit has already hit fan, one or two more pieces shouldn't
> something anybody cares about.
> 

Some more digging shows that this happens a lot with Google GCE intances,
typically after a panic. The problem with that, if I understand correctly,
is that it may prevent coredumps from being written. So, while of course
the panic is what needs to be fixed, it is still quite annoying, and it
would help if this can be fixed for panic handling as well.

How about the patch suggested by Hillf Danton ? Would that help for the
panic case ?

Thanks,
Guenter

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ