linux-kernel - Re: bts & perf

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20090630193229.GD20567@elte.hu>
Date:	Tue, 30 Jun 2009 21:32:29 +0200
From:	Ingo Molnar <mingo@...e.hu>
To:	"Metzger, Markus T" <markus.t.metzger@...el.com>
Cc:	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Markus Metzger <markus.t.metzger@...glemail.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: bts & perf_counters

* Metzger, Markus T <markus.t.metzger@...el.com> wrote:

> > How does 'interval' get mixed with BTS?
> 
> We could view BTS as event-based sampling with interval=1. The 
> sample we collect is the <from, to> address pair of an executed 
> branch and the sampling interval is 1, i.e. we store a sample for 
> every branch. Wouldn't this be how BTS integrates into 
> perf_counters?

Yeah, this is how i view it too.

> One of the big advantages that comes with using the perf_counter 
> framework is that you could mix branch tracing with other forms of 
> profiling and sampling.

Correct.

> >> Would it be possible for a user to profile the same task twice? 
> >> He could then use different buffers for different sampling 
> >> intervals.
> >
> > It's possibe to open multiple counters to the same task, yes.
> 
> That's good. And users could mmap every counter they open in order 
> to get multiple perf event streams?

Yes.

> OK. The existing implementation reconfigured DS area to have the 
> h/w already collect the trace into the correct buffer. The only 
> copying that is ever needed is to copy it into user-space while 
> translating the arch-specific format into an arch-independent 
> format.
> 
> This is obviously only possible for a single user. Copying the 
> data is definitely more flexible if we expect multiple users of 
> that data with different-sized buffers.

Yeah. [ That decoupling is nice as it also allows multiplexing - 
there's nothing that prevents from two independent monitor tasks 
from sampling the same task. (beyond the inevitable runtime overhead 
that is inherent in BTS anyway.) ]

> > If a task schedules out then it will have its DS area drained 
> > already to the mmap buffer - i.e. it's all properly 
> > synchronized.
> 
> When is that draining done? Somewhere in schedule()? Wouldn't that 
> be quite expensive for a few pages of BTS buffer?

Well, it is an open question how frequently we want to move 
information from the DS area into the mmap pages.

The most direct approach would be to 'flush' the DS from two places: 
the threshold IRQ handler plus from the context switch code if the 
BTS counter gets deactivated. In the latter case BTS activities have 
to stop anyway, so the DS can be flushed to the mmap pages.

Or is your mental model for getting the BTS records from the DS to 
the mmap pages significantly different?

I think we should shoot for the simplest approach initially - we can 
do other, more sophisticated streaming modes later as well - they 
will not differ in functionality, only in performance.

> Hmmm, I'll see what I can do. Please don't expect a minimally 
> working prototype to be bug-free from the beginning.

Sure, i dont.

> I see identifying the beginning of the stream as well as random 
> accesses into the stream as bigger open points.
> 
> Maybe we could add a mode where records are zero-extended to a 
> fixed size. This would leave the choice to the user: compact 
> format or random access.

I agree that streaming is a problem because the debugger does not 
want to poll() really - such an output mode and a 'ignore data_tail 
and overwrite old entries' ring-buffer modus operandi should be 
added.

The latter would be useful for tracepoints too for example, so such 
a 'flight recorder' or 'history buffer' mode is not limited to BTS.

So feel free to add something that meets your constant-size records 
needs - and we'll make sure it fits well into the rest of 
perfcounters.

So based on your suggestion we'd have two streaming models:

 - 'no information loss' output model where user-space poll()s and 
   tries hard not to lose events (this is what profilers and 
   reliable tracers do)

 - 'history ring-buffer' model - this is useful for debuggers and is 
   useful for certain modes of tracing as well. (crash-tracing for 
   example)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/