lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170421145756.305735607@infradead.org>
Date:   Fri, 21 Apr 2017 16:57:56 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     tglx@...utronix.de, mingo@...nel.org
Cc:     linux-kernel@...r.kernel.org, ville.syrjala@...ux.intel.com,
        daniel.lezcano@...aro.org, rafael.j.wysocki@...el.com,
        marta.lofstedt@...el.com, martin.peres@...ux.intel.com,
        pasha.tatashin@...cle.com, peterz@...radead.org,
        daniel.vetter@...ll.ch
Subject: [PATCH 0/9] sched_clock fixes

Hi,

These patches were inspired (and hopefully fix) two independent bug reports on
Core2 machines.

I never could quite reproduce one, but my Core2 machine no longer switches to
stable sched_clock and therefore no longer tickles the problematic stable ->
unstable transition either.

Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
random values into the TSC MSR from userspace. Behaviour in that case is
improved as well.

People have to realize that if we manage to boot with TSC 'stable' (both
sched_clock and clocksource) and we later find out we were mistaken (we observe
a TSC wobble) the clocks that are derived from it _will_ have had an observable
hickup. This is fundamentally unfixable.

If you own a machine where the BIOS tries to hide SMI latencies by rewinding
TSC (yes, this is a thing), the very best we can do is mark TSC unstable with a
boot parameter.

For example, this is me writing a stupid value into the TSC:

[   46.745082] random: crng init done
[18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[18443029775.023141] clocksource:                       'hpet' wd_now: 3ebec538 wd_last: 3e486ec9 mask: ffffffff
[18443029775.034214] clocksource:                       'tsc' cs_now: 5025acce9 cs_last: 24dc3bd21c88ee mask: ffffffffffffffff
[18443029775.046651] tsc: Marking TSC unstable due to clocksource watchdog
[18443029775.054211] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[18443029775.064434] sched_clock: Marking unstable (70569005835, -17833788)<-(-3714295689546517, -2965802361)
[   70.573700] clocksource: Switched to clocksource hpet

With some trace_printk()s (not included) I could tell that the wobble
occured at 69.965474. The clock now resumes where it 'should' have been.

But an unfortunate scheduling event could have resulted in one task
having seen a runtime of ~584 years with 'obvious' effects. Similar
jumps can also be observed from userspace GTOD usage.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ