linux-kernel - Re: [PATCH] a patch to fix the cpu-offline-online problem caused by pm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1296469006.15234.359.camel@laptop>
Date:	Mon, 31 Jan 2011 11:16:46 +0100
From:	Peter Zijlstra <peterz@...radead.org>
To:	Luming Yu <luming.yu@...il.com>
Cc:	LKML <linux-kernel@...r.kernel.org>, Len Brown <lenb@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>, tglx <tglx@...utronix.de>
Subject: Re: [PATCH] a patch to fix the cpu-offline-online problem caused
 by pm_idle

On Sun, 2011-01-30 at 22:26 -0500, Luming Yu wrote:

> > Guessing is totally the wrong thing when you're sending stuff upstream,
> > esp ugly patches such as this. .32 is more than a year old, anything
> > could have happened.
> 
> Ok. the default upstream kernel seems to have NMI watchdog disabled?

Then enable it already, its a whole CONFIG option away..

> It's not working because of NMI watchdog. If you ignore NMI watchdog,
> then I guess it works but just slow..

Don't guess, test it dammit. And then figure out why it triggers, I
haven't seen _anything_ that would cause it to trigger, nor a sane
explanation for your patch.

> > Ok, so one IPI costs 50-100 us, even with 64 cpu, that's at most 6.4ms
> > nowhere near enough to trigger the NMI watchdog. So what does go wrong?
> 
> Good question!
> But we also can't forget there were large latency from C3.

Not 60+ seconds large I hope, I know NHM-EX has some suckage, but surely
not that bad?

> And I guess some reschedule ticks get lost to kick some CPUs out of
> idle due to the side effects of the CPU PM feature. if use nohz=off,
> everything seems to just work.
> Yes, I agree we need to dig it out either.
> But it's kind of combination problem between the special stop_machine
> context and CPU power management...

Yeah, so? Also, incidentally, stop-machine got a rewrite around .35 and
again significant changes in .37, so please do test mainline and not
your dinosaur.

> > Yeah, what are you smoking? Why do you wreck perfectly fine code for one
> > backward ass piece of hardware.
> 
> Just make things less complex...

But its wrong, it very clearly works around a real problem, don't ever
do that, fix the problem!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/