linux-kernel - Re: perfevents: irq loop stuck!

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1405160017120.1351@vincent-weaver-1.umelst.maine.edu>
Date:	Fri, 16 May 2014 00:25:28 -0400 (EDT)
From:	Vince Weaver <vincent.weaver@...ne.edu>
To:	Peter Zijlstra <peterz@...radead.org>
cc:	Vince Weaver <vincent.weaver@...ne.edu>,
	linux-kernel@...r.kernel.org, Paul Mackerras <paulus@...ba.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: perfevents: irq loop stuck!

On Thu, 15 May 2014, Peter Zijlstra wrote:

> > So, not sure how to fix this without a total re-write, unless we want to 
> > cheat and just say sample_period is capped at 62-bits or something.
> 
> 63 bits should do I think, but yes, we hit a very similar but a few days
> ago in the sched_deadline code.
> 
> I'm fine with capping it, allowing the full 64bit would mean we need 65
> bits (effectively 96 or 128 bit of course) math to make it all work
> which would be tedious and give no real gain.

Yes, it looks like 63 bits will be fine although I had to think about some 
of the logic there to make sure there's no signed overflow.

> We'll then not make progress for a while, print the msg, get throttled,
> goto 1. This is possible if we're allowed 100+ interrupts per jiffy, so
> if you adjust /proc/sys/kernel/perf_event_max_sample_rate to below that
> and it doesn't trigger anymore we know the throttle works.

The odd thing is that even when I try to write a small reproducer I can 
easily get the overflow of period to 2 to happen, but it doesn't trigger 
the message.  It will on occasion give a throttle message but that is all.

The fuzzer can reliably reproduce the actual IRQ warning, but a replay of 
a gathered trace will not.

Even weirder, the even that triggers it is attached to CPU0 but only 
triggers if the fuzzer itself is running on a different CPU.  Also the 
trigger event has all of exclude_kernel, exclude_user, and exclude_hv set
so I'm not sure how it even counts up to the 2 retired instructions to 
trigger an overflow anyway.  It doesn't help that there are 100+ other 
active events at the time, although suspiciously a few of them are 
apic_irq tracepoints (though attached to CPUs other than the trouble one).

anyway I'm not sure if it's worth tracking this more if it's possible to 
mostly fix the case by fixing the sample_period bounds.

Vince
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/