linux-kernel - Re: WARNING: CPU: 1 PID: 0 at kernel/time/tick-broadcast.c:668 tick_broadcast_oneshot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.02.1402111702090.21991@ionos.tec.linutronix.de>
Date:	Tue, 11 Feb 2014 17:07:53 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Stanislaw Gruszka <sgruszka@...hat.com>
cc:	poma <pomidorabelisima@...il.com>,
	Linux Kernel list <linux-kernel@...r.kernel.org>,
	linux-pm@...r.kernel.org, Olaf Hering <olaf@...fle.de>,
	Dave Jones <davej@...hat.com>,
	"Justin M. Forbes" <jforbes@...hat.com>,
	Josh Boyer <jwboyer@...hat.com>,
	Mailing-List fedora-kernel <kernel@...ts.fedoraproject.org>
Subject: Re: WARNING: CPU: 1 PID: 0 at kernel/time/tick-broadcast.c:668
 tick_broadcast_oneshot_control+0x17d/0x190()

On Tue, 11 Feb 2014, Stanislaw Gruszka wrote:

> On Mon, Feb 10, 2014 at 07:59:39PM +0100, poma wrote:
> > On 10.02.2014 11:06, Thomas Gleixner wrote:
> > > On Mon, 10 Feb 2014, poma wrote:
> > > 
> > >> [   83.558551]  [<ffffffff81025b17>] amd_e400_idle+0x87/0x130
> > > 
> > > So this seems to happen only on AMD machines which use that e400 idle
> > > mode. I have no idea at the moment whats wrong there. I'll find one of
> > > those machines and try to reproduce.
> 
> I tried to debug that warn as well. Even if I found machine with proper
> family and model number, HW C1E bug do not happen there, hence I just
> hack kernel to always use amd_e400_idle (and remove AMD rdmsr specific
> instructions to do not crash). That make issue 100% reproducible when
> suspend/resume.

It's also reproducible on cpu online/offline.
 
> It happens when cpu become idle, call CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
> but before CLOCK_EVT_NOTIFY_BROADCAST_EXIT, interrupt trigger on that
> cpu. IRQ is handled by hrtimer code, which want to switch to hres and
> call:
> 
> tick_switch_to_oneshot() -> ... -> tick_broadcast_setup_oneshot()
> 
> Since we have already proper handler there, last procedure clear
> tick_broadcast_oneshot_mask, but tick_broadcast_pending_mask stay
> set. When amd_e400_idle next time call CLOCK_EVT_NOTIFY_BROADCAST_ENTER,
> the warning will happen.
> 
> I came with a below patch, which also clear pending mask, but perhaps

Fun. I came up with the exact same solution independent of you and I
tested it on real C1E contaminated hardware.

> oneshot_mask should not be cleared on tick_broadcast_setup_oneshot(),
> or should be cleared only conditionally, or some other solution is

We can do it unconditionally. It creates consistent state in all
corner cases.

There are other solutions to the problem, but that needs a major
rework of the broadcast code. I so wish that this mess would have
never been necessary at all ...

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/