[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190729205059.GA1127@roeck-us.net>
Date: Mon, 29 Jul 2019 13:50:59 -0700
From: Guenter Roeck <linux@...ck-us.net>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, x86@...nel.org,
Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
Borislav Petkov <bp@...en8.de>
Subject: Re: sched: Unexpected reschedule of offline CPU#2!
On Mon, Jul 29, 2019 at 12:47:45PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 12:38:30PM +0200, Thomas Gleixner wrote:
> > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > On Mon, Jul 29, 2019 at 11:58:24AM +0200, Thomas Gleixner wrote:
> > > > On Mon, 29 Jul 2019, Peter Zijlstra wrote:
> > > > > On Sat, Jul 27, 2019 at 09:44:50AM -0700, Guenter Roeck wrote:
> > > > > > [ 61.348866] Call Trace:
> > > > > > [ 61.349392] kick_ilb+0x90/0xa0
> > > > > > [ 61.349629] trigger_load_balance+0xf0/0x5c0
> > > > > > [ 61.349859] ? check_preempt_wakeup+0x1b0/0x1b0
> > > > > > [ 61.350057] scheduler_tick+0xa7/0xd0
> > > > >
> > > > > kick_ilb() iterates nohz.idle_cpus_mask to find itself an idle_cpu().
> > > > >
> > > > > idle_cpus_mask() is set from nohz_balance_enter_idle() and cleared from
> > > > > nohz_balance_exit_idle(). nohz_balance_enter_idle() is called from
> > > > > __tick_nohz_idle_stop_tick() when entering nohz idle, this includes the
> > > > > cpu_is_offline() clause of the idle loop.
> > > > >
> > > > > However, when offline, cpu_active() should also be false, and this
> > > > > function should no-op.
> > > >
> > > > Ha. That reboot mess is not clearing cpu active as it's not going through
> > > > the regular cpu hotplug path. It's using reboot IPI which 'stops' the cpus
> > > > dead in their tracks after clearing cpu online....
> > >
> > > $string-of-cock-compliant-curses
> > >
> > > What a trainwreck...
> > >
> > > So if it doesn't play by the normal rules; how does it expect to work?
> > >
> > > So what do we do? 'Fix' reboot or extend the rules?
> >
> > Reboot has two modes:
> >
> > - Regular reboot initiated from user space
> >
> > - Panic reboot
> >
> > For the regular reboot we can make it go through proper hotplug,
>
> That seems sensible.
>
> > for the panic case not so much.
>
> It's panic, shit has already hit fan, one or two more pieces shouldn't
> something anybody cares about.
>
Some more digging shows that this happens a lot with Google GCE intances,
typically after a panic. The problem with that, if I understand correctly,
is that it may prevent coredumps from being written. So, while of course
the panic is what needs to be fixed, it is still quite annoying, and it
would help if this can be fixed for panic handling as well.
How about the patch suggested by Hillf Danton ? Would that help for the
panic case ?
Thanks,
Guenter
Powered by blists - more mailing lists