linux-kernel - Re: [BUG REPORT] ktime_get

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAO6TR8UyQwGrYe1KhER7z-=ALTQJXUg4ZiJDAUX-NyhQDOYHOQ@mail.gmail.com>
Date:	Wed, 20 Jan 2016 09:40:07 -0700
From:	Jeff Merkey <linux.mdb@...il.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	John Stultz <john.stultz@...aro.org>
Subject: Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

On 1/20/16, Thomas Gleixner <tglx@...utronix.de> wrote:
> Jeff,
>
> On Wed, 20 Jan 2016, Thomas Gleixner wrote:
>> On Tue, 19 Jan 2016, Jeff Merkey wrote:
>> > Nasty bug but trivial fix for this.  What happens here is RAX (nsecs)
>> > gets set to a huge value (RAX = 0x17AE7F57C671EA7D) and passed through
>>
>> And how exactly does that happen?
>>
>> 0x17AE7F57C671EA7D = 1.70644e+18  nsec
>> 		   = 1.70644e+09  sec
>> 		   = 2.84407e+07  min
>> 		   = 474011	  hrs
>> 		   = 19750.5	  days
>> 		   = 54.1109	  years
>>
>> That's the real issue, not what you are trying to 'fix' in
>> timespec_add_ns()
>
> And that's caused by stopping the whole machine for 20 minutes. It violates
> the assumption of the timekeeping core, that the maximum time which is
> between
> two updates of the core is < 5-10min. So that insane large number is caused
> by a
> mult overrun when converting the time delta to nanoseconds.
>
> You can find that limit via:
>
> # dmesg | grep tsc | grep max_idle_ns
> [    5.242683] clocksource tsc: mask: 0xffffffffffffffff max_cycles:
> 0x21139a22526, max_idle_ns: 440795252169 ns
>
> So on that machine the limit is:
>
>    440795252169 nsec
>    440.795	sec
>    7.34659	min
>
> And before you ask or come up with patches: No, we are not going to add
> anything to the core timekeeping code to work around this limitation simply
> because its going to add overhead to a performance sensitive code path for
> a
> very limited value.

Given how fragile that code appears to be, this is reasonable.

>
> Keeping a machine stopped for 20 minutes will make a lot of other things
> unhappy, so introducing a 'fix' for that particular issue is just silly.
>

You know what's needed here is some form of touch function to keep this
system updated while spinning in the debugger.  That would solve it.
I can maintain
a fix for that locally.  I debugged the soft hang in systemd last
night, and I discovered
that its all related to this function returning bogus time (systemd
was doing a system call that eventually made its way to ktime_get_ts64
and got returned garbage).   When this wraps it causes all sorts of
bad stuff.

Do you have any suggestions on how a touch function could be coded to keep this
subsystem updated while the debugger is active?  There are already a
few of them I
have to call as well as kgdb and kdb to get around some of this.

void mdb_watchdogs(void)
{
    touch_softlockup_watchdog_sync();
    clocksource_touch_watchdog();

#if defined(CONFIG_TREE_RCU)
    rcu_cpu_stall_reset();
#endif

    touch_nmi_watchdog();
#ifdef CONFIG_HARDLOCKUP_DETECTOR
    touch_hardlockup_watchdog();
#endif
    return;
}

As you can see, there are already quite a few subsystems that manage
this problem of
debuggers holding the system in stasis.

Jeff

> Thanks,
>
> 	tglx
>

Well, that explains it.