linux-kernel - Re: Regression in 2.6.27 caused by commit bfc0f59

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.1.10.0809020759480.3210@nehalem.linux-foundation.org>
Date:	Tue, 2 Sep 2008 08:09:23 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Thomas Gleixner <tglx@...utronix.de>
cc:	Larry Finger <Larry.Finger@...inger.net>,
	LKML <linux-kernel@...r.kernel.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	Alok Kataria <akataria@...are.com>,
	Michael Buesch <mb@...sch.de>
Subject: Re: Regression in 2.6.27 caused by commit bfc0f59

On Tue, 2 Sep 2008, Thomas Gleixner wrote:
> 
> Analysing the tsc_deltas gave interesting insight. On the affected
> laptop I had several entries where the delta between two reads was
> from 1msec up to 120msec maximum. 

Ok, that sounds like a good approach to find if it's done by some 
kind of emulation or not. Of course, any machine with SMM (even if it 
doesn't emulate the PIT per se - maybe it just gets some event related to 
overheating or other 'maintenance' stuff) can have occasional hickups, but 
the '120msec' thing is, I think, the real clincher. 

Why? Because we only try to wait for 50ms in the first place! Even if 
emulation is 100% exact (or even none at all, and the PIT accesses are in 
hardware), if we have a 120ms hickup while waiting for 50ms, then the end 
result will obviously be total crap, and yes, that sure explains how you 
can get >100% wrong values.

> So what I'm working on is an algorithm, which is similar to the checks
> in the tsc_read_refs() function. That should allow us to detect
> whether one of the reads is way off by doing a min/max detection. In
> such a case we can either repeat the calibration or try to figure out
> whether the pmtimer / hpet can provide us with some useful reference.

I think the most trivial approach would be to

 - just keep track of the max TSC difference for each loop iteration.

 - if the max TSC is bigger than 1% of the total TSC, then something is 
   already seriously wrong (either we had very few loops indeed, or some 
   of them were very expensive)

 - perhaps loop over the calibration, and make the TSC calibration loop 
   increase the delay. Because even if there is a 120ms hickup, if we had 
   used a longer calibration delay, we'd probably not have noticed (well, 
   ok, 120ms is pretty damning and is probably just unfixable, but smaller 
   hickups are probably harmless)

Additionally doing a min/max comparison to see that the loop is very 
_stable_ is of course also a way to validate things, but expecting _too_ 
much stability may be wrong too. As mentioned, SMM events can happen for 
other reasons than emulation.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/