lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 14 Dec 2016 21:59:37 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Roland Scheidegger <rscheidegger_lists@...peed.ch>
cc:     LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Borislav Petkov <bp@...en8.de>,
        Bruce Schlobohm <bruce.schlobohm@...el.com>,
        Kevin Stanton <kevin.b.stanton@...el.com>,
        Allen Hung <allen_hung@...l.com>
Subject: Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent
 TSC deadline timer irq storm

On Wed, 14 Dec 2016, Thomas Gleixner wrote:

> On Wed, 14 Dec 2016, Roland Scheidegger wrote:
> > Am 13.12.2016 um 17:46 schrieb Thomas Gleixner:
> > > What are the adjust values after a warm boot?
> >
> > So, after cold boot with a kernel which doesn't adjust TSCs, then warm
> > boot I got:
> > [    0.000000] TSC ADJUST: CPU0: -602358264300 176072418728
> > [    0.000000] TSC ADJUST: Boot CPU0: -602358264300
> > [    0.172245] TSC ADJUST: CPU1: -602360207584 176587932558
> > [    0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1:
> > -602360207584
> > [    0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU1: -602360207584
> > [    0.252663] TSC ADJUST: CPU2: -602359000822 176828627154
> > [    0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2:
> > -602359000822
> > [    0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU2: -602359000822
> > [    0.337014] TSC ADJUST: CPU3: -602360177680 177081093132
> > [    0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3:
> > -602360177680
> > [    0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300
> > CPU3: -602360177680
> > 
> > and so on.
> > 
> > Albeit after another reboot (some minutes later), it actually straight
> > locked up again:
> > 
> > TSC ADJUST: CPU1: -8257481427958 165112676430
> > TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958
> > TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958
> > TSC target sync skip
> > ...
> > smpboot: Target CPU is online
> > 
> > So, actually I thought the TSC would get reset too on warm boot, but
> > clearly looks like that isn't the case...
> > But I don't know what's the difference between first and second reboot -
> > the adjust values have just more magnitude, but otherwise even the
> > direction of the adjustments and everything looks all the same (just
> > like cold boot, which also looks all the same to me).
> 
> I haven't found a pattern for the lockups yet and we have to wait for Intel
> to provide useful information about that issue. All we know so far is that
> negative adjust values are dangerous.

Did some futher investigation. The values which cause the interrupt storms
have very clear identifiable points which reliably reproduce:

Positive space, results in timer not firing anymore - at least not in a
time frame you are willing to wait for.

     0x0000 0000 8000 0000

Negative space, results in an interrupt storm.

     0xffff ffff 0000 0000
     0xffff fffe 0000 0000
     0xffff fffd 0000 0000
     0xffff fffc 0000 0000
     0xffff fffb 0000 0000
     ....

These points are independent of the underlying counter value (cold boot,
warm boot) and even reproduce after hours of power on reliably.

And looking at the values makes me wonder about 32bit vs. 64bit wreckage
combined with sign expansion done wrong. Im really impressed!

In the negative space there is something else going on which is dependent
on the counter value. Right after cold boot the space is closer to zero
than after hours of power on.

So the approach of forbidding negative values is definitely not wrong.

Thanks,

	tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ