linux-kernel - Re: Discussion: quick_pit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150610073009.GA14879@gmail.com>
Date:	Wed, 10 Jun 2015 09:30:09 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	George Spelvin <linux@...izon.com>
Cc:	adrian.hunter@...el.com, tglx@...utronix.de, ak@...ux.intel.com,
	hpa@...or.com, linux-kernel@...r.kernel.org, luto@...capital.net,
	torvalds@...ux-foundation.org,
	Arjan van de Ven <arjan@...radead.org>,
	Pekka Enberg <penberg@....fi>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Borislav Petkov <bp@...en8.de>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: Discussion: quick_pit_calibrate is slow


* George Spelvin <linux@...izon.com> wrote:

> Adrian Hunter wrote:
>
> > A bigger issue for my case is that "slow" calibration is not that slow, taking 
> > only 10ms anyway which is much better than the 50ms max for so-called "quick" 
> > calibration.
> 
> I read the code, and after figuring out which comments are wrong, this is 
> absolutely right.  The "quick" code takes 20 ms if it works, or 50 ms if it has 
> problems.  The fallback code can finish in 10 ms.
> 
> When bit rot has set it to this degree, something needs to be done.
> 
> Given that people would like to boot VMs in less than 1 second, 20 ms is 2% of 
> that budget, and something that is worth trying to improve.

As a side note: so VMs often want to skip the whole calibration business, because 
they are running on a well-calibrated host.

1,000 msecs is also an eternity: consider for example the KVM + tools/kvm based 
"Clear Containers" feature from Arjan:

  https://lwn.net/Articles/644675/

... which boots up a generic Linux kernel to generic Linux user-space in 32 
milliseconds, i.e. it boots in 0.03 seconds (!).

> ** quick_pit_calibrate()
> ** native_pit_calibrate()

Nice analysis.

> * The problem with te above
> 
> The regular calibration is full of ad-hoc constants and limits,
> and doesn't know how accurate a result it's producing.  The
> quick_pit_calibrate at least has a well-defined goal (500 ppm) and
> a systematic way to achieve it.
> 
> The big problem is that polling a port looking for a transition
> fundamentally leaves +/- one port read (i.e. +/-1.3 us) of uncertainty
> as to when the transition actually happened.
> 
> 
> * How to do better
> 
> If, instead, we did
> 	tsc1 = get_cycles();
> 	lsb = inb(0x42);
> 	tsc2 = get_cycles();
> 	msb = inb(0x42);
> 	delta = tsc2 - tsc1;
> 	min_delta = min(min_delta, delta);
> 	uncertainty = delta - min_delta;
> 
> and actually use the captured msb/lsb pair, we could do a lot better.
> 
> First of all, it's only one read, not two.  The 8254 timer latches the
> msbyte when the lsbyte is read and returns the latched value on the
> next read, so there's no need to include the second inb() in the
> uncertainty window.
> 
> That gets you down to about +/-0.65 us.  But we can do better!
> 
> The idea comes from NTP.  The point is that, as we're only comparing
> timer *rates*, we don't care about when in the tsc1..tsc2 window the
> PIT result was latched as long as it's consistent.
> 
> The way to do that is to keep track, over all of the hundreds of
> iterations, the *minimum* TSC delta across the inb().  This can
> be assumed to be the "speed of light", with a perfectly consistent
> time offset.
> 
> If a read takes longer than that, there is a delay.  And we don't
> know whether the delay happened before the PIT latched its result
> or after.  So we have to add the delay to the uncertainty estimate
> of the read.
> 
> But the bulk of the 1.3 us of delay can be simply ignored.  It doesn't
> matter if it's a slow emulated access via a VM or SMI trap; as long as
> the time taken is *consistent*, we can do an accurate calibration.
> 
> This reduces the read uncertainty to +/-0.5 of a PIT tick, or +/-0.42 us.
> Plus the extra uncertainty due to the read time, but we can repeat the
> read until we get a fast one with low uncertainty.
> 
> With a total of 0.84 us of read uncertaity (1/12 of quick_pit_calibrate
> currently), we can get within 500 ppm within 1.75 us.  Or do better
> within 5 or 10.

(msec you mean I suspect?)

Fully agreed otherwise: if you look at the sawtooth pattern visible in the TSC 
deltas there's a very strong, constant underlying frequency that can be recovered 
statistically with just a few dozen iterations.

> And the same technique can be applied to reading the HPET or PM timer.
> 
> The PM timer's PIIX hardware bugs make things trickier, but with some
> careful coding, it's possible to capture the TSC around just the critical
> read, and not include the surrounding validation reads.  So we can get
> the full benefit of the 280 ns clock period.

Cool.

> ** How I'd like to replace the existing code
> 
> The loop I'd write would start the PIC (and the RTC, if we want to)
> and then go round-robin reading all the time sources and associated
> TSC values.

I'd just start with the PIT to have as few balls in flight as possible.

> This gives us lots of reads to measure the minimum access time.
> 
> After a mimimum number of iterations (so the minimum access time is
> stable), when the ratio of elapsed / (start uncertainty+end uncertainty)
> is satisfactory (e.g. 2048 for current 500 ppm spec), then various
> measures are sanity-checked against each other and the best one returned.
> 
> 
> Rather than program the PIT for 10 us, I'd program it for 55 us (0xffff),
> and run until I got something good enough, or it ran out.
> 
> inb(0x61) & 0x20 is set when the PIT runs out and never implciitly cleared,
> so it's an excellent way to detect even extreme delays.  As long as
> that bit is clear, the PIT value is a reliable absolute time in the
> calibration process.

Agreed.

> * Request for comments
> 
> So, as I go about turning these ideas into code... any comments?

Could you please structure it the following way:

 - first a patch that fixes bogus comments about the current code. It has 
   bitrotten and if we change it significantly we better have a well
   documented starting point that is easier to compare against.

 - then a patch that introduces your more accurate calibration method and
   uses it as the first method to calibrate. If it fails (and it should have a 
   notion of failing) then it should fall back to the other two methods.

 - possibly add a boot option to skip your new calibration method - i.e. to make 
   the kernel behave in the old way. This would be useful for tracking down any 
   regressions in this.

 - then maybe add a patch for the RTC method, but as a .config driven opt-in 
   initially.

Please also add calibration tracing code (.config driven and default-off), so that 
the statistical properties of calibration can be debugged and validated without 
patching the kernel.

> I realize this is a far bigger overhaul than Adrian proposed, but do other 
> people agree that some decruftification is warranted?

Absolutely!

> I'd also like to cut the time required to do the calibration.  Being able to 
> compute the quality of the approximation so far is a huge help.

Agreed. Btw., I'd suggest to lower the ppm target in your new code, to make sure 
we use high quality results and fall back to the current (bitrotten but working) 
code.

> Any suggestions for a reasonable time/quality tradeoff?  500 ppm ASAP?  Best I 
> can do in 10 ms?  Wait until the PIT is 500 ppm and then use the better result 
> from a higher-resolution timer if available?

So I'd suggest a minimum polling interval (at least 1 msecs?) plus a ppm target. 
Would 100ppm be too aggressive?

> It's some pretty low-level mucking about, but is a nice simple single-threaded 
> single-processor piece of code.

:-)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/