linux-kernel - Re: Unified tracing buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 22 Sep 2008 19:27:23 +0530
From:	"K.Prasad" <prasad@...ux.vnet.ibm.com>
To:	Martin Bligh <mbligh@...gle.com>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Mathieu Desnoyers <compudj@...stal.dyndns.org>,
	Steven Rostedt <rostedt@...dmis.org>, od@...ell.com,
	"Frank Ch. Eigler" <fche@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>, hch@....de,
	David Wilder <dwilder@...ibm.com>, zanussi@...cast.net
Subject: Re: Unified tracing buffer

On Fri, Sep 19, 2008 at 02:33:42PM -0700, Martin Bligh wrote:
> During kernel summit and Plumbers conference, Linus and others
> expressed a desire for a unified
> tracing buffer system for multiple tracing applications (eg ftrace,
> lttng, systemtap, blktrace, etc) to use.
> This provides several advantages, including the ability to interleave
> data from multiple sources,
> not having to learn 200 different tools, duplicated code/effort, etc.
> 

With due apologies for pitching-in late, I thought I'd bring visibility to
the two new interfaces - namely relay_printk() and relay_dump() - now a part
of -mm tree (since 2.6.27-rc5-mm1) are meant to address such needs;
although not completely in its present form but quite substantially.
(Refer: Documentation/filesystems/relay.txt). As far as re-usability is
concerned, many parts of this interface are directly adopted from
SystemTap's runtime. Blktrace had been made to work using these interfaces
(http://tinyurl.com/4q9d4p) reducing about ~130 lines of code from the
blktrace related files.

With more effort, say additions such as a)ability to specify custom
names for files b)ability to create user-defined control files (in
addition to what comes default) will make it usable along with tracers
such as ftrace (ref:http://tinyurl.com/3ppbwh) (and is something that I
intended to work upon).

While relay_printk() interface brings a high-level abstract interface
over 'relay' by masking all the setup/tear-down details and the ability
to use per-CPU buffers; relay_dump() is its equivalent that performs
binary dumping through debugfs interface (a requirement for the unified
tracing buffer, as I learn from the email). Also the use of default
file-names, debugfs output path results in huge reduction of setup code
required by the end-user along with the ability to override the defaults
if required in a special case. Examples of the resulting code-brevity can
be seen at samples/relay/*.c in 2.6.27-rc5-mm1 tree.

I am quite sure that with minimal changes to infrastructure underlying
beneath these two interfaces, we can meet out most of the requirements
stated above; and am open for suggestions.

Kindly let me know what the community thinks about the same.

Thanks,
K.Prasad

> Several of us got together last night and tried to cut this down to
> the simplest usable system
> we could agree on (and nobody got hurt!). This will form version 1.
> I've sketched out a few
> enhancements we know that we want, but have agreed to leave these
> until version 2.
> The answer to most questions about the below is "yes we know, we'll
> fix that in version 2"
> (or 3). Simplicity was the rule ...
> 
> Sketch of design.  Enjoy flaming me. Code will follow shortly.
> 
> 
> STORAGE
> -------
> 
> We will support multiple buffers for different tracing systems, with
> separate names, event id spaces.
> Event ids are 16 bit, dynamically allocated.
> A "one line of text" print function will be provided for each event,
> or use the default (probably hex printf)
> Will provide a "flight data recorder" mode, and a "spool to disk" mode.
> 
> Circular buffer per cpu, protected by per-cpu spinlock_irq
> Word aligned records.
> Variable record length, header will start with length record.
> Timestamps in fixed timebase, monotonically increasing (across all CPUs)
> 
> 
> INPUT_FUNCTIONS
> ---------------
> 
> allocate_buffer (name, size)
>         return buffer_handle
> 
> register_event (buffer_handle, event_id, print_function)
>         You can pass in a requested event_id from a fixed set, and
> will be given it, or an error
>         0 means allocate me one dynamically
>         returns event_id     (or -E_ERROR)
> 
> record_event (buffer_handle, event_id, length, *buf)
> 
> 
> OUTPUT
> ------
> 
> Data will be output via debugfs, and provide the following output streams:
> 
> /debugfs/tracing/<name>/buffers/text
>     clear text stream (will merge the per-cpu streams via insertion
> sort, and use the print functions)
> 
> /debugfs/tracing/<name>/buffers/binary[cpu_number]
>     per-cpu binary data
> 
> 
> CONTROL
> -------
> 
> Sysfs style tree under debugfs
> 
> /debugfs/tracing/<name>/buffers/enabed         <--- binary value
> 
> /debugfs/tracing/<name>/<event1>
> /debugfs/tracing/<name>/<event2>
>     etc ...
>     provides a way to enable/disable events, see what's available, and
> what's enabled.
> 
> 
> KNOWN ISSUES / PLANS
> -------------------
> 
> No way to unregister buffers and events.
>     Will provide an unregister_buffer and unregister_event call
> 
> 
> Generating systemwide time is hard on some platforms
>     Yes. Time-based output provides a lot of simplicity for the user though
>     We won't support these platforms at first, we'll add functionality
> to make it work for them later.
>     (plan based on tick-based ms timing, plus counter offset from that
> if needed).
> 
> Spinlock_irq is ineffecient, and doesn't support tracing in NMIs
>     True. We'll implement a lockless scheme later (see lttng)
> 
> Putting a length record in every event is inefficient
>     True. Fixed record length with optional extensions is better, but
> more complex. v2.
> 
> Putting a full timestamp rather than an offset in every event is inefficient
>     See above. True, but v2.
> 
> Relayfs already exists! use that!
>     People were universally not keen on that idea. Complexity, interface, etc.
>     We're also providing some higher level shared functions for time &
> event ids.
> 
> There's no way to decode the binary data stream
>     Code will be shared from the kernel to decode it, so that we can
> get the compact binary
>     format and decode it later. That code will be kept in the kernel
> tree (it's a trivial piece of C).
>     Version 1.1 ;-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/