lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAO6TR8Wz9RhHOZ8doZWG4Qgfj_ZfW8X9Ldjtfg_3yGw_kbJ=6A@mail.gmail.com>
Date:	Thu, 21 Jan 2016 00:09:52 -0700
From:	Jeff Merkey <linux.mdb@...il.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	John Stultz <john.stultz@...aro.org>
Subject: Re: [BUG REPORT] ktime_get_ts64 causes Hard Lockup

Ok, here's what I found after several hours of debugging and reviewing
this subsystem:

This subsystem plays is pretty loose in doing its math on 64 bit
registers.  I traced through ktime_get_ts64 hundreds of times and
sampled data running through it and from what I saw, just normal
operations comes dangerously close to causing the RAX register to
wrap.   If the delta gets too big it does wrap and I observed it
happening with the debugger tracing through the code.  It wraps
because of a sar instruction generated from the inline macros.

The wrap happens in this inline function.

static inline s64 timekeeping_get_ns(struct tk_read_base *tkr)
{
	cycle_t delta;
	s64 nsec;

	delta = timekeeping_get_delta(tkr);

	nsec = delta * tkr->mult + tkr->xtime_nsec;
	nsec >>= tkr->shift;    << wrap caused here

	/* If arch requires, add in get_arch_timeoffset() */
	return nsec + arch_gettimeoffset();
}

You only have 64 bits of register and the numbers being calculated
here are big.   By way of example, I observed the following during
normal operations:

delta  (RAX)   |     tkr->mult (RDX)

0x157876            0x65ee27
0xf1855               0x65f158
0x16cf05             0x65f408
303bc3                0x65f154

When this bug occurs different story.

delta  (RAX)    |     tkr->mult (RDX)

0x243283994b8     0x65233

So it goes like this:

	nsec = delta * tkr->mult + tkr->xtime_nsec;
        0x243283994b8 * 0x65233
        imul   rax,rdx = 0xE6A2Ce1f1ea690a8

	nsec >>= tkr->shift;    << wrap caused here
        sar    rax,cl  =  0xFFFFFFE6BFB3B7C3

the sar instruction doesn't just shift, it backfills the signedness of
the value, so this instruction is not doing what the C code is asking
it to do.  I am guessing that somewhere in this mass of macros,
something may have gotten declared wrong or incomplete (declared
signed ?).

The assembler output for this section that calls the macro to
calculate nsecs shows the sar instruction:

	delta = timekeeping_get_delta(tkr);

	nsec = delta * tkr->mult + tkr->xtime_nsec;
     29b:	48 0f af c2          	imul   %rdx,%rax
     29f:	48 03 05 00 00 00 00 	add    0x0(%rip),%rax        # 2a6
<ktime_get_ts64+0xc6>
	nsec >>= tkr->shift;
     2a6:	48 d3 f8             	sar    %cl,%rax


There is another problem with the tkr->read returning an unchanging,
unclearable number when this bug occurs for the delta value.  I
appears for whatever reason the clock has gone to sleep or gone away
and is no longer updating its counters.

static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
{
	cycle_t cycle_now, delta;

	/* read clocksource */
	cycle_now = tkr->read(tkr->clock); << returns the same value after
this bug happens

	/* calculate the delta since the last update_wall_time */
	delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); <<
cycle last is also the same value.

	return delta;
}

This problem appears to have several things happening at once.
Probably the most concerning is that the assembler output is making
some assumptions about the SIGNEDNESS of the values being shifted and
using sar instead of shl instructions.

I am also concerned about the thr->read function returning an
unchanging value when this problem shows up.

This subsystem plays it fast and loose with its math, and if the clock
gets delayed or out of sync, it will wrap in the above function and it
will trigger the Hard Lockup detector if the value is large enough in
RAX.  The sanity check for CONFIG_DEBUG_TIMEKEEPER does not catch the
code path where this delta value gets set because the function to
update the delta is called in more then just in that function that
checks for an overflow and the wrap case happens underneath it.

I would check how these structs are defined and the vars in them to
see if somewhere they are declared as signed values to the compiler,
because that's what it thinks it was given to compile.

I am still debugging the thr->read issue.  I have determined the cause
of the wrap in the assembler.  As to why the gcc compiler is outputing
this instruction here is something to be determined.

Jeff

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ