linux-kernel - Re: [BUG REPORT] ktime_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.11.1601201350230.3575@nanos>
Date:	Wed, 20 Jan 2016 15:26:58 +0100 (CET)
From:	Thomas Gleixner <tglx@...utronix.de>
To:	Jeff Merkey <linux.mdb@...il.com>
cc:	LKML <linux-kernel@...r.kernel.org>,
	John Stultz <john.stultz@...aro.org>
Subject: Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Jeff,

On Wed, 20 Jan 2016, Thomas Gleixner wrote:
> On Tue, 19 Jan 2016, Jeff Merkey wrote:
> > Nasty bug but trivial fix for this.  What happens here is RAX (nsecs)
> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
> 
> And how exactly does that happen?
> 
> 0x17AE7F57C671EA7D = 1.70644e+18  nsec
> 		   = 1.70644e+09  sec
> 		   = 2.84407e+07  min
> 		   = 474011	  hrs
> 		   = 19750.5	  days
> 		   = 54.1109	  years
> 
> That's the real issue, not what you are trying to 'fix' in timespec_add_ns()

And that's caused by stopping the whole machine for 20 minutes. It violates
the assumption of the timekeeping core, that the maximum time which is between
two updates of the core is < 5-10min. So that insane large number is caused by a
mult overrun when converting the time delta to nanoseconds.

You can find that limit via:

# dmesg | grep tsc | grep max_idle_ns
[    5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles: 0x21139a22526, max_idle_ns: 440795252169 ns

So on that machine the limit is:

   440795252169 nsec
   440.795	sec
   7.34659	min

And before you ask or come up with patches: No, we are not going to add
anything to the core timekeeping code to work around this limitation simply
because its going to add overhead to a performance sensitive code path for a
very limited value.

Keeping a machine stopped for 20 minutes will make a lot of other things
unhappy, so introducing a 'fix' for that particular issue is just silly.

Thanks,

	tglx