linux-kernel - Re: frequent lockups in 3.18rc4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20141227203343.GA16318@linux.vnet.ibm.com>
Date:	Sat, 27 Dec 2014 12:33:43 -0800
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	John Stultz <john.stultz@...aro.org>,
	Dave Jones <davej@...emonkey.org.uk>,
	Thomas Gleixner <tglx@...utronix.de>, Chris Mason <clm@...com>,
	Mike Galbraith <umgwanakikbuti@...il.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Dâniel Fraga <fragabr@...il.com>,
	Sasha Levin <sasha.levin@...cle.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Suresh Siddha <sbsiddha@...il.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Peter Anvin <hpa@...ux.intel.com>
Subject: Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
> On Mon, Dec 22, 2014 at 3:59 PM, John Stultz <john.stultz@...aro.org> wrote:
> >
> > * So 1/8th of the interval seems way too short, as there's
> > clocksources like the ACP PM, which wrap every 2.5 seconds or so.
> 
> Ugh. At the same time, 1/8th of a range is actually bigger than I'd
> like, since if there is some timer corruption, it means that we only
> catch it when it's in really big.
> 
> But as I said, I'd actually prefer it to be time-based, because it
> would be good if this approach worked on things like the TSC which is
> a 64-bit counter..
> 
> So yes, that capping was very much arbitrary, and was mostly a case of
> "this works with the one timer source that I can easily trigger"
> 
> > * I suspect something closer to the clocksource_max_deferment() value
> > (which I think is max interval before multiplication overflows could
> > happen - ~12%) which we use in the scheduler would make more sense.
> > Especially since the timer scheduler uses that to calculate how long
> > we can idle for.
> 
> I'd rather not be anywhere *close* to any overflow problems. Even for
> the scheduler all-idle case, I'd argue that there is rather quickly
> diminishing returns. Yes, a thousand timer interrupts per second are
> expensive and a noticeable power draw. The difference between "one
> timer interrupt every two seconds" and "every 20 seconds" is rather
> less noticeable.
> 
> Of course, reasonable clock sources have *much* longer periods than a
> second (yeah, the acpi pm timer really isn't a good one), so there are
> probably good middle grounds, The 1/8th was a hack, and one that was
> aware of teh 300s cycle of the HPET at that..

I of course very much like the idea of the timekeeping system doing a
bit of self-checking.  ;-)

Can we simplify things by just not doing self-checking on clocks that
overflow in less than a few minutes?  In other words, if someone
reports oddball RCU CPU stall warnings when using the ACPI PM
timer, am I within my rights to tell them to reproduce the stall
using a better timer?

If so, we could possibly get away with the assumption that preemptions
don't last longer than the soft-lockup interval, which is currently
a bit over 20 seconds (hence "a few minutes" above).  And yes, you
might avoid a soft lockup by having a series of 15-second preemptions
between each pair of instructions, but my response would be to increase
my "a few minutes" a bit and to invoke probabilities.

I don't claim to understand the timer code for all the reasons that
Linus calls out below, but I believe that this simplifying
assumption would in turn simplify the self-check code.

						Thanx, Paul

> > * Nulling out delta in timekeeping_get_ns() seems like it could cause
> > problems since time would then possibly go backwards compared to
> > previous reads (as you mentioned, resulting in smaller time jumps).
> > Instead it would probably make more sense to cap the delta at the
> > maximum value (though this assumes the clock doesn't jump back in the
> > interval before the next call to update_wall_time).
> 
> So part of the nulling was that it was simpler, and part of it was
> that I expected to get backwards jumps (see the other email to Dave
> about the inherent races). And with the whole timer mask modulo
> arithmetic, those backwards jumps just look like biggish positive
> numbers, not even negative. So it ends up being things like "is it an
> unsigned number larger than half the mask? Consider it negative" etc.
> 
> The "zero it out" was simple, and it worked for my test-case, which
> was "ok, my machine no longer locks up when I mess with the timer".
> 
> And I didn't post the earlier versions of that patch that didn't even *boot*.
> 
> I started out trying to do it at a higher level (not on a clock read
> level, but outside the whole 'convert-to-ns and do the sequence  value
> check'), but during bootup we play a lot of games with initializing
> the timer sources etc.
> 
> So that explains the approach of doing it at that
> 
>    cycle_now = tkr->read(tkr->clock);
> 
> level, and keeping it very low-level.
> 
> But as I already explained in the email that crossed, that low-level
> thing also results in some fundamental races.
> 
> > * Also, as you note, this would just cause the big time jump to only
> > happen at the next update, since there's no logic in
> > update_wall_time() to limit the jump. I'm not sure if "believing" the
> > large jump at write time make that much more sense, though.
> 
> So I considered just capping it there (to a single interval or
> something). Again, just ignoring - like the read side does - it would
> have been easier, but at the same time I *really* wanted to make time
> go forward, so just taking the big value seemed safest.
> 
> But yes. this was very much a RFC patch. It's not even ready for real
> use, as DaveJ found out (although it might be good enough in practice,
> despite its flaws)
> 
> > * Finally, you're writing to error while only holding a read lock, but
> > that's sort of a minor thing.
> 
> It's not a minor thing, but the alternatives looked worse.
> 
> I really wanted to make it per-cpu, and do this with interrupts
> disabled or something. But that then pushes a big problem to the write
> time to go over all cpu's and see if there are errors.
> 
> So it's not right. But .. It's a hacky patch to get discussion
> started, and it's actually hard to do "right" when this code has to be
> basically lockless.
> 
> > * Checking the accumulation interval isn't beyond the
> > clocksource_max_deferment() value seems like a very good check to have
> > in update_wall_time().
> 
> Sounds like a good idea. Also, quite frankly, reading all the code I
> wasn't ever really able to figure out that things don't overflow. The
> overflow protection is a bit ad-hoc (that maxshift thing in
> update_wall_time() really makes baby Jesus cry, despite the season,
> and it wasn't at all obvious that ntp_tick_length() is fundamentally
> bigger than xtime_interval, for example).
> 
> It's also not clear that the complicated and frankly not-very-obvious
> shift-loop is any faster than just using a divide - possibly with the
> "single interval" case being a special case to avoid dividing then.
> 
> I was a bit nervous that the whole update of tkr.cycle_last in there
> could just overrun the actual *read* value of 'tk->tkr.clock'. With
> the whole offset logic split between update_wall_time() and
> logarithmic_accumulation(), the code isn't exactly self-explanatory.
> 
> Heh.
> 
> > * Maybe when we schedule the next timekeeping update, the tick
> > scheduler could store the expected time for that to fire, and then we
> > could validate that we're relatively close after that value when we do
> > accumulate time (warning if we're running too early or far too late -
> > though with virtualziation, defining a "reasonable" late value is
> > difficult).
> 
> In general, it would be really nice to know what the expected limits
> are. It was hard to impossible to figure out the interaction between
> the timer subsystem and the scheduler tick. It's pretty incestuous,
> and if there's an explanation for it, I missed it.
> 
> > * This "expected next tick" time could be used to try to cap read-time
> > intervals in a similar fashion as done here. (Of course, again, we'd
> > have to be careful, since if that expected next tick ends up somehow
> > being before the actual hrtimer expiration value, we could end up
> > stopping time - and the system).
> 
> I don't think you can cap them to exactly the expected value anyway,
> since the wall time update *will* get delayed by locking and just
> interrupts being off etc. And virtual environments will obviously make
> it much worse. So the capping needs to be somewhat loose anyway.
> 
> The patch I posted was actually sloppy by design, exactly because I
> had so much trouble with trying to be strict. My first patch was a
> percpu thing that just limited ktime_get() from ever going backwards
> on that particular cpu (really simple, real;ly stupid), and it got
> *nowhere*.
> 
>                         Linus
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/