lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <1364475498.6345.223.camel@gandalf.local.home>
Date:	Thu, 28 Mar 2013 08:58:18 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	LKML <linux-kernel@...r.kernel.org>
Cc:	Ingo Molnar <mingo@...nel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Namhyung Kim <namhyung@...nel.org>,
	Keun-O Park <kpark3469@...il.com>,
	David Sharp <dhsharp@...gle.com>
Subject: Re: [GIT PULL] tracing: multibuffers, new triggers, clocks, and more

Ping?

-- Steve


On Fri, 2013-03-22 at 17:30 -0400, Steven Rostedt wrote:
> Ingo,
> 
> A lot has changed and this has been in linux-next for a while. Instead
> of spamming LKML with a large patch set, as all changes have already
> been posted to LKML, I'm posting this as one big patch of all the
> changes involved. Here's the summary:
> 
> The biggest change was the addition of multiple tracing buffers and a
> new directory called "instances". Doing a mkdir here creates a new
> tracing directory that has its own buffers. Only trace events can be
> enabled and currently no tracers can (that's for 3.11 ;-). But its fully
> functional. It also includes the ability of snapshots and per cpu
> referencing, and buffer management.
> 
> Use of slabs have brought the memory footprint down a little.
> 
> The tracing files now block as they should and described in the read(2)
> man pages.
> 
> The max_tr has been replaced by the trace_array holding two buffer
> pointers that can now swap. This allows the multiple buffers to also
> take advantage of snapshots.
> 
> Added allocation of the snapshot buffer in the kernel command line.
> 
> Added trace_puts() and special macro magic to trace_printk() to use it
> when it has no arguments to the format string. This makes trace_printk()
> have an even smaller footprint to recording what is happening.
> 
> Added new function triggers, where the function tracer hits a specified
> function, it can enable/disable an event, cause a snapshot, or
> stacktrace.
> 
> Added new trace clocks: uptime and perf
> 
> Added new ring buffer self test, to make sure it doesn't lose any events
> (it never did, but something else caused events to be lost and I thought
> it was the ring buffer).
> 
> Updated some much needed documentation.
> 
> -- Steve
> 
> 
> Please pull the latest tip/perf/core tree, which can be found at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> tip/perf/core
> 
> Head SHA1: 22f45649ce08642ad7df238d5c25fa5c86bfdd31
> 
> 
> Li Zefan (4):
>       tracing: Add a helper function for event print functions
>       tracing: Annotate event field-defining functions with __init
>       tracing/syscalls: Annotate field-defining functions with __init
>       tracing: Fix some section mismatch warnings
> 
> Steven Rostedt (13):
>       tracing: Separate out trace events from global variables
>       tracing: Use RING_BUFFER_ALL_CPUS for TRACE_PIPE_ALL_CPU
>       tracing: Encapsulate global_trace and remove dependencies on global vars
>       tracing: Pass the ftrace_file to the buffer lock reserve code
>       tracing: Replace the static global per_cpu arrays with allocated per_cpu
>       tracing: Make syscall events suitable for multiple buffers
>       tracing: Add interface to allow multiple trace buffers
>       tracing: Add rmdir to remove multibuffer instances
>       tracing: Get trace_events kernel command line working again
>       tracing: Use kmem_cache_alloc instead of kmalloc in trace_events.c
>       tracing: Use direct field, type and system names
>       tracing: Fix polling on trace_pipe_raw
>       tracing: Fix read blocking on trace_pipe_raw
> 
> Steven Rostedt (Red Hat) (50):
>       tracing: Do not block on splice if either file or splice NONBLOCK flag is set
>       tracing/ring-buffer: Move poll wake ups into ring buffer code
>       tracing: Add __per_cpu annotation to trace array percpu data pointer
>       tracing: Fix trace events build without modules
>       ring-buffer: Init waitqueue for blocked readers
>       tracing: Add comment for trace event flag IGNORE_ENABLE
>       tracing: Only clear trace buffer on module unload if event was traced
>       tracing: Clear all trace buffers when unloaded module event was used
>       tracing: Enable snapshot when any latency tracer is enabled
>       tracing: Consolidate max_tr into main trace_array structure
>       tracing: Add snapshot in the per_cpu trace directories
>       tracing: Add config option to allow snapshot to swap per cpu
>       tracing: Add snapshot_raw to extract the raw data from snapshot
>       tracing: Have trace_array keep track if snapshot buffer is allocated
>       tracing: Consolidate buffer allocation code
>       tracing: Add snapshot feature to instances
>       tracing: Add per_cpu directory into tracing instances
>       tracing: Prevent deleting instances when they are being read
>       tracing: Add internal tracing_snapshot() functions
>       ring-buffer: Do not use schedule_work_on() for current CPU
>       tracing: Move the tracing selftest code into its own function
>       tracing: Add alloc_snapshot kernel command line parameter
>       tracing: Fix the branch tracer that broke with buffer change
>       tracing: Add trace_puts() for even faster trace_printk() tracing
>       tracing: Optimize trace_printk() with one arg to use trace_puts()
>       tracing: Add internal ftrace trace_puts() for ftrace to use
>       tracing: Let tracing_snapshot() be used by modules but not NMI
>       tracing: Consolidate updating of count for traceon/off
>       tracing: Consolidate ftrace_trace_onoff_unreg() into callback
>       ftrace: Separate unlimited probes from count limited probes
>       ftrace: Fix function probe to only enable needed functions
>       tracing: Add alloc/free_snapshot() to replace duplicate code
>       tracing: Add snapshot trigger to function probes
>       tracing: Fix comments for ftrace_event_file/call flags
>       ftrace: Clean up function probe methods
>       ftrace: Use manual free after synchronize_sched() not call_rcu_sched()
>       tracing: Add a way to soft disable trace events
>       tracing: Add function probe triggers to enable/disable events
>       tracing: Add skip argument to trace_dump_stack()
>       tracing: Add function probe to trigger stack traces
>       tracing: Use stack of calling function for stack tracer
>       tracing: Fix stack tracer with fentry use
>       tracing: Remove most or all of stack tracer stack size from stack_max_size
>       tracing: Add function-trace option to disable function tracing of latency tracers
>       tracing: Add "uptime" trace clock that uses jiffies
>       tracing: Add "perf" trace_clock
>       tracing: Bring Documentation/trace/ftrace.txt up to date
>       ring-buffer: Add ring buffer startup selftest
>       tracing: Fix ftrace_dump()
>       tracing: Update debugfs README file
> 
> zhangwei(Jovi) (6):
>       tracing: Use pr_warn_once instead of open coded implementation
>       tracing: Use TRACE_MAX_PRINT instead of constant
>       tracing: Move find_event_field() into trace_events.c
>       tracing: Convert trace_destroy_fields() to static
>       tracing: Fix comment about prefix in arch_syscall_match_sym_name()
>       tracing: Rename trace_event_mutex to trace_event_sem
> 
> ----
>  Documentation/kernel-parameters.txt  |    7 +
>  Documentation/trace/ftrace.txt       | 2097 ++++++++++++++++++++++----------
>  include/linux/ftrace.h               |    6 +-
>  include/linux/ftrace_event.h         |  109 +-
>  include/linux/kernel.h               |   70 +-
>  include/linux/ring_buffer.h          |    6 +
>  include/linux/trace_clock.h          |    1 +
>  include/trace/ftrace.h               |   47 +-
>  kernel/trace/Kconfig                 |   49 +
>  kernel/trace/blktrace.c              |    4 +-
>  kernel/trace/ftrace.c                |   73 +-
>  kernel/trace/ring_buffer.c           |  500 +++++++-
>  kernel/trace/trace.c                 | 2204 ++++++++++++++++++++++++----------
>  kernel/trace/trace.h                 |  144 ++-
>  kernel/trace/trace_branch.c          |    8 +-
>  kernel/trace/trace_clock.c           |   10 +
>  kernel/trace/trace_entries.h         |   23 +-
>  kernel/trace/trace_events.c          | 1421 +++++++++++++++++-----
>  kernel/trace/trace_events_filter.c   |   34 +-
>  kernel/trace/trace_export.c          |    4 +-
>  kernel/trace/trace_functions.c       |  207 +++-
>  kernel/trace/trace_functions_graph.c |   12 +-
>  kernel/trace/trace_irqsoff.c         |   85 +-
>  kernel/trace/trace_kdb.c             |   12 +-
>  kernel/trace/trace_mmiotrace.c       |   12 +-
>  kernel/trace/trace_output.c          |  119 +-
>  kernel/trace/trace_output.h          |    4 +-
>  kernel/trace/trace_sched_switch.c    |    8 +-
>  kernel/trace/trace_sched_wakeup.c    |   87 +-
>  kernel/trace/trace_selftest.c        |   51 +-
>  kernel/trace/trace_stack.c           |   74 +-
>  kernel/trace/trace_syscalls.c        |   90 +-
>  32 files changed, 5672 insertions(+), 1906 deletions(-)
> ---------------------------
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 6c72381..0edc409 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -320,6 +320,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  			on: enable for both 32- and 64-bit processes
>  			off: disable for both 32- and 64-bit processes
>  
> +	alloc_snapshot	[FTRACE]
> +			Allocate the ftrace snapshot buffer on boot up when the
> +			main buffer is allocated. This is handy if debugging
> +			and you need to use tracing_snapshot() on boot up, and
> +			do not want to use tracing_snapshot_alloc() as it needs
> +			to be done where GFP_KERNEL allocations are allowed.
> +
>  	amd_iommu=	[HW,X86-64]
>  			Pass parameters to the AMD IOMMU driver in the system.
>  			Possible values are:
> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
> index a372304..bfe8c29 100644
> --- a/Documentation/trace/ftrace.txt
> +++ b/Documentation/trace/ftrace.txt
> @@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc.
>  Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
>  	     John Kacur, and David Teigland.
>  Written for: 2.6.28-rc2
> +Updated for: 3.10
>  
>  Introduction
>  ------------
> @@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel.
>  It can be used for debugging or analyzing latencies and
>  performance issues that take place outside of user-space.
>  
> -Although ftrace is the function tracer, it also includes an
> -infrastructure that allows for other types of tracing. Some of
> -the tracers that are currently in ftrace include a tracer to
> -trace context switches, the time it takes for a high priority
> -task to run after it was woken up, the time interrupts are
> -disabled, and more (ftrace allows for tracer plugins, which
> -means that the list of tracers can always grow).
> +Although ftrace is typically considered the function tracer, it
> +is really a frame work of several assorted tracing utilities.
> +There's latency tracing to examine what occurs between interrupts
> +disabled and enabled, as well as for preemption and from a time
> +a task is woken to the task is actually scheduled in.
> +
> +One of the most common uses of ftrace is the event tracing.
> +Through out the kernel is hundreds of static event points that
> +can be enabled via the debugfs file system to see what is
> +going on in certain parts of the kernel.
>  
> 
>  Implementation Details
> @@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name.
>  
>  That's it! (assuming that you have ftrace configured into your kernel)
>  
> -After mounting the debugfs, you can see a directory called
> +After mounting debugfs, you can see a directory called
>  "tracing".  This directory contains the control and output files
>  of ftrace. Here is a list of some of the key files:
>  
> @@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files:
>  
>  	This sets or displays whether writing to the trace
>  	ring buffer is enabled. Echo 0 into this file to disable
> -	the tracer or 1 to enable it.
> +	the tracer or 1 to enable it. Note, this only disables
> +	writing to the ring buffer, the tracing overhead may
> +	still be occurring.
>  
>    trace:
>  
> @@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files:
>  
>  	This file lets the user control the amount of data
>  	that is displayed in one of the above output
> -	files.
> +	files. Options also exist to modify how a tracer
> +	or events work (stack traces, timestamps, etc).
> +
> +  options:
> +
> +	This is a directory that has a file for every available
> +	trace option (also in trace_options). Options may also be set
> +	or cleared by writing a "1" or "0" respectively into the
> +	corresponding file with the option name.
>  
>    tracing_max_latency:
>  
> @@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files:
>  	latency is greater than the value in this
>  	file. (in microseconds)
>  
> +  tracing_thresh:
> +
> +	Some latency tracers will record a trace whenever the
> +	latency is greater than the number in this file.
> +	Only active when the file contains a number greater than 0.
> +	(in microseconds)
> +
>    buffer_size_kb:
>  
>  	This sets or displays the number of kilobytes each CPU
> -	buffer can hold. The tracer buffers are the same size
> +	buffer holds. By default, the trace buffers are the same size
>  	for each CPU. The displayed number is the size of the
>  	CPU buffer and not total size of all buffers. The
>  	trace buffers are allocated in pages (blocks of memory
> @@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files:
>  	than requested, the rest of the page will be used,
>  	making the actual allocation bigger than requested.
>  	( Note, the size may not be a multiple of the page size
> -	  due to buffer management overhead. )
> +	  due to buffer management meta-data. )
>  
> -	This can only be updated when the current_tracer
> -	is set to "nop".
> +  buffer_total_size_kb:
> +
> +	This displays the total combined size of all the trace buffers.
> +
> +  free_buffer:
> +
> +	If a process is performing the tracing, and the ring buffer
> +	should be shrunk "freed" when the process is finished, even
> +	if it were to be killed by a signal, this file can be used
> +	for that purpose. On close of this file, the ring buffer will
> +	be resized to its minimum size. Having a process that is tracing
> +	also open this file, when the process exits its file descriptor
> +	for this file will be closed, and in doing so, the ring buffer
> +	will be "freed".
> +
> +	It may also stop tracing if disable_on_free option is set.
>  
>    tracing_cpumask:
>  
>  	This is a mask that lets the user only trace
> -	on specified CPUS. The format is a hex string
> -	representing the CPUS.
> +	on specified CPUs. The format is a hex string
> +	representing the CPUs.
>  
>    set_ftrace_filter:
>  
> @@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files:
>  	"set_ftrace_notrace". (See the section "dynamic ftrace"
>  	below for more details.)
>  
> +  enabled_functions:
> +
> +	This file is more for debugging ftrace, but can also be useful
> +	in seeing if any function has a callback attached to it.
> +	Not only does the trace infrastructure use ftrace function
> +	trace utility, but other subsystems might too. This file
> +	displays all functions that have a callback attached to them
> +	as well as the number of callbacks that have been attached.
> +	Note, a callback may also call multiple functions which will
> +	not be listed in this count.
> +
> +	If the callback registered to be traced by a function with
> +	the "save regs" attribute (thus even more overhead), a 'R'
> +	will be displayed on the same line as the function that
> +	is returning registers.
> +
> +  function_profile_enabled:
> +
> +	When set it will enable all functions with either the function
> +	tracer, or if enabled, the function graph tracer. It will
> +	keep a histogram of the number of functions that were called
> +	and if run with the function graph tracer, it will also keep
> +	track of the time spent in those functions. The histogram
> +	content can be displayed in the files:
> +
> +	trace_stats/function<cpu> ( function0, function1, etc).
> +
> +  trace_stats:
> +
> +	A directory that holds different tracing stats.
> +
> +  kprobe_events:
> + 
> +	Enable dynamic trace points. See kprobetrace.txt.
> +
> +  kprobe_profile:
> +
> +	Dynamic trace points stats. See kprobetrace.txt.
> +
> +  max_graph_depth:
> +
> +	Used with the function graph tracer. This is the max depth
> +	it will trace into a function. Setting this to a value of
> +	one will show only the first kernel function that is called
> +	from user space.
> +
> +  printk_formats:
> +
> +	This is for tools that read the raw format files. If an event in
> +	the ring buffer references a string (currently only trace_printk()
> +	does this), only a pointer to the string is recorded into the buffer
> +	and not the string itself. This prevents tools from knowing what
> +	that string was. This file displays the string and address for
> +	the string allowing tools to map the pointers to what the
> +	strings were.
> +
> +  saved_cmdlines:
> +
> +	Only the pid of the task is recorded in a trace event unless
> +	the event specifically saves the task comm as well. Ftrace
> +	makes a cache of pid mappings to comms to try to display
> +	comms for events. If a pid for a comm is not listed, then
> +	"<...>" is displayed in the output.
> +
> +  snapshot:
> +
> +	This displays the "snapshot" buffer and also lets the user
> +	take a snapshot of the current running trace.
> +	See the "Snapshot" section below for more details.
> +
> +  stack_max_size:
> +
> +	When the stack tracer is activated, this will display the
> +	maximum stack size it has encountered.
> +	See the "Stack Trace" section below.
> +
> +  stack_trace:
> +
> +	This displays the stack back trace of the largest stack
> +	that was encountered when the stack tracer is activated.
> +	See the "Stack Trace" section below.
> +
> +  stack_trace_filter:
> +
> +	This is similar to "set_ftrace_filter" but it limits what
> +	functions the stack tracer will check.
> +
> +  trace_clock:
> +
> +	Whenever an event is recorded into the ring buffer, a
> +	"timestamp" is added. This stamp comes from a specified
> +	clock. By default, ftrace uses the "local" clock. This
> +	clock is very fast and strictly per cpu, but on some
> +	systems it may not be monotonic with respect to other
> +	CPUs. In other words, the local clocks may not be in sync
> +	with local clocks on other CPUs.
> +
> +	Usual clocks for tracing:
> +
> +	  # cat trace_clock
> +	  [local] global counter x86-tsc
> +
> +	  local: Default clock, but may not be in sync across CPUs
> +
> +	  global: This clock is in sync with all CPUs but may
> +	  	  be a bit slower than the local clock.
> +
> +	  counter: This is not a clock at all, but literally an atomic
> +	  	   counter. It counts up one by one, but is in sync
> +		   with all CPUs. This is useful when you need to
> +		   know exactly the order events occurred with respect to
> +		   each other on different CPUs.
> +
> +	  uptime: This uses the jiffies counter and the time stamp
> +	  	  is relative to the time since boot up.
> +
> +	  perf: This makes ftrace use the same clock that perf uses.
> +	  	Eventually perf will be able to read ftrace buffers
> +		and this will help out in interleaving the data.
> +
> +	  x86-tsc: Architectures may define their own clocks. For
> +	  	   example, x86 uses its own TSC cycle clock here.
> +
> +	To set a clock, simply echo the clock name into this file.
> +
> +	  echo global > trace_clock
> +
> +  trace_marker:
> +
> +	This is a very useful file for synchronizing user space
> +	with events happening in the kernel. Writing strings into
> +	this file will be written into the ftrace buffer.
> +
> +	It is useful in applications to open this file at the start
> +	of the application and just reference the file descriptor
> +	for the file.
> +
> +	void trace_write(const char *fmt, ...)
> +	{
> +		va_list ap;
> +		char buf[256];
> +		int n;
> +
> +		if (trace_fd < 0)
> +			return;
> +
> +		va_start(ap, fmt);
> +		n = vsnprintf(buf, 256, fmt, ap);
> +		va_end(ap);
> +
> +		write(trace_fd, buf, n);
> +	}
> +
> +	start:
> +
> +		trace_fd = open("trace_marker", WR_ONLY);
> +
> +  uprobe_events:
> + 
> +	Add dynamic tracepoints in programs.
> +	See uprobetracer.txt
> +
> +  uprobe_profile:
> +
> +	Uprobe statistics. See uprobetrace.txt
> +
> +  instances:
> +
> +	This is a way to make multiple trace buffers where different
> +	events can be recorded in different buffers.
> +	See "Instances" section below.
> +
> +  events:
> +
> +	This is the trace event directory. It holds event tracepoints
> +	(also known as static tracepoints) that have been compiled
> +	into the kernel. It shows what event tracepoints exist
> +	and how they are grouped by system. There are "enable"
> +	files at various levels that can enable the tracepoints
> +	when a "1" is written to them.
> +
> +	See events.txt for more information.
> +
> +  per_cpu:
> +
> +	This is a directory that contains the trace per_cpu information.
> +
> +  per_cpu/cpu0/buffer_size_kb:
> +
> +	The ftrace buffer is defined per_cpu. That is, there's a separate
> +	buffer for each CPU to allow writes to be done atomically,
> +	and free from cache bouncing. These buffers may have different
> +	size buffers. This file is similar to the buffer_size_kb
> +	file, but it only displays or sets the buffer size for the
> +	specific CPU. (here cpu0).
> +
> +  per_cpu/cpu0/trace:
> +
> +	This is similar to the "trace" file, but it will only display
> +	the data specific for the CPU. If written to, it only clears
> +	the specific CPU buffer.
> +
> +  per_cpu/cpu0/trace_pipe
> +
> +	This is similar to the "trace_pipe" file, and is a consuming
> +	read, but it will only display (and consume) the data specific
> +	for the CPU.
> +
> +  per_cpu/cpu0/trace_pipe_raw
> +
> +	For tools that can parse the ftrace ring buffer binary format,
> +	the trace_pipe_raw file can be used to extract the data
> +	from the ring buffer directly. With the use of the splice()
> +	system call, the buffer data can be quickly transferred to
> +	a file or to the network where a server is collecting the
> +	data.
> +
> +	Like trace_pipe, this is a consuming reader, where multiple
> +	reads will always produce different data.
> +
> +  per_cpu/cpu0/snapshot:
> +
> +	This is similar to the main "snapshot" file, but will only
> +	snapshot the current CPU (if supported). It only displays
> +	the content of the snapshot for a given CPU, and if
> +	written to, only clears this CPU buffer.
> +
> +  per_cpu/cpu0/snapshot_raw:
> +
> +	Similar to the trace_pipe_raw, but will read the binary format
> +	from the snapshot buffer for the given CPU.
> +
> +  per_cpu/cpu0/stats:
> +
> +	This displays certain stats about the ring buffer:
> +
> +	 entries: The number of events that are still in the buffer.
> +
> +	 overrun: The number of lost events due to overwriting when
> +	 	  the buffer was full.
> +
> +	 commit overrun: Should always be zero.
> +	 	This gets set if so many events happened within a nested
> +		event (ring buffer is re-entrant), that it fills the
> +		buffer and starts dropping events.
> +
> +	 bytes: Bytes actually read (not overwritten).
> +
> +	 oldest event ts: The oldest timestamp in the buffer
> +
> +	 now ts: The current timestamp
> +
> +	 dropped events: Events lost due to overwrite option being off.
> +
> +	 read events: The number of events read.
>  
>  The Tracers
>  -----------
> @@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured.
>          RT tasks (as the current "wakeup" does). This is useful
>          for those interested in wake up timings of RT tasks.
>  
> -  "hw-branch-tracer"
> -
> -	Uses the BTS CPU feature on x86 CPUs to traces all
> -	branches executed.
> -
>    "nop"
>  
>  	This is the "trace nothing" tracer. To remove all
> @@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace"
>                               --------
>  # tracer: function
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> -            bash-4251  [01] 10152.583854: path_put <-path_walk
> -            bash-4251  [01] 10152.583855: dput <-path_put
> -            bash-4251  [01] 10152.583855: _atomic_dec_and_lock <-dput
> +# entries-in-buffer/entries-written: 140080/250280   #P:4
> +#
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +            bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
> +            bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
> +            bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
> +            sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
> +            bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
> +            bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
> +            bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
> +            bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
> +            bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
> +            sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
>                               --------
>  
>  A header is printed with the tracer name that is represented by
> -the trace. In this case the tracer is "function". Then a header
> -showing the format. Task name "bash", the task PID "4251", the
> -CPU that it was running on "01", the timestamp in <secs>.<usecs>
> -format, the function name that was traced "path_put" and the
> -parent function that called this function "path_walk". The
> -timestamp is the time at which the function was entered.
> +the trace. In this case the tracer is "function". Then it shows the
> +number of events in the buffer as well as the total number of entries
> +that were written. The difference is the number of entries that were
> +lost due to the buffer filling up (250280 - 140080 = 110200 events
> +lost).
> +
> +The header explains the content of the events. Task name "bash", the task
> +PID "1977", the CPU that it was running on "000", the latency format
> +(explained below), the timestamp in <secs>.<usecs> format, the
> +function name that was traced "sys_close" and the parent function that
> +called this function "system_call_fastpath". The timestamp is the time
> +at which the function was entered.
>  
>  Latency trace format
>  --------------------
>  
> -When the latency-format option is enabled, the trace file gives
> -somewhat more information to see why a latency happened.
> -Here is a typical trace.
> +When the latency-format option is enabled or when one of the latency
> +tracers is set, the trace file gives somewhat more information to see
> +why a latency happened. Here is a typical trace.
>  
>  # tracer: irqsoff
>  #
> -irqsoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: apic_timer_interrupt
> - => ended at:   do_softirq
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -  <idle>-0     0d..1    0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
> -  <idle>-0     0d.s.   97us : __do_softirq (do_softirq)
> -  <idle>-0     0d.s1   98us : trace_hardirqs_on (do_softirq)
> +# irqsoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: __lock_task_sighand
> +#  => ended at:   _raw_spin_unlock_irqrestore
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +      ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
> +      ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
> +      ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
> +      ps-6143    2d..1  306us : <stack trace>
> + => trace_hardirqs_on_caller
> + => trace_hardirqs_on
> + => _raw_spin_unlock_irqrestore
> + => do_task_stat
> + => proc_tgid_stat
> + => proc_single_show
> + => seq_read
> + => vfs_read
> + => sys_read
> + => system_call_fastpath
>  
> 
>  This shows that the current tracer is "irqsoff" tracing the time
> -for which interrupts were disabled. It gives the trace version
> -and the version of the kernel upon which this was executed on
> -(2.6.26-rc8). Then it displays the max latency in microsecs (97
> -us). The number of trace entries displayed and the total number
> -recorded (both are three: #3/3). The type of preemption that was
> -used (PREEMPT). VP, KP, SP, and HP are always zero and are
> -reserved for later use. #P is the number of online CPUS (#P:2).
> +for which interrupts were disabled. It gives the trace version (which
> +never changes) and the version of the kernel upon which this was executed on
> +(3.10). Then it displays the max latency in microseconds (259 us). The number
> +of trace entries displayed and the total number (both are four: #4/4).
> +VP, KP, SP, and HP are always zero and are reserved for later use.
> +#P is the number of online CPUs (#P:4).
>  
>  The task is the process that was running when the latency
> -occurred. (swapper pid: 0).
> +occurred. (ps pid: 6143).
>  
>  The start and stop (the functions in which the interrupts were
>  disabled and enabled respectively) that caused the latencies:
>  
> -  apic_timer_interrupt is where the interrupts were disabled.
> -  do_softirq is where they were enabled again.
> + __lock_task_sighand is where the interrupts were disabled.
> + _raw_spin_unlock_irqrestore is where they were enabled again.
>  
>  The next lines after the header are the trace itself. The header
>  explains which is which.
> @@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers.
>  
>    The rest is the same as the 'trace' file.
>  
> +  Note, the latency tracers will usually end with a back trace
> +  to easily find where the latency occurred.
>  
>  trace_options
>  -------------
>  
> -The trace_options file is used to control what gets printed in
> -the trace output. To see what is available, simply cat the file:
> +The trace_options file (or the options directory) is used to control
> +what gets printed in the trace output, or manipulate the tracers.
> +To see what is available, simply cat the file:
>  
>    cat trace_options
> -  print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
> -  noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
> +print-parent
> +nosym-offset
> +nosym-addr
> +noverbose
> +noraw
> +nohex
> +nobin
> +noblock
> +nostacktrace
> +trace_printk
> +noftrace_preempt
> +nobranch
> +annotate
> +nouserstacktrace
> +nosym-userobj
> +noprintk-msg-only
> +context-info
> +latency-format
> +sleep-time
> +graph-time
> +record-cmd
> +overwrite
> +nodisable_on_free
> +irq-info
> +markers
> +function-trace
>  
>  To disable one of the options, echo in the option prepended with
>  "no".
> @@ -428,13 +770,34 @@ Here are the available options:
>  
>    bin - This will print out the formats in raw binary.
>  
> -  block - TBD (needs update)
> +  block - When set, reading trace_pipe will not block when polled.
>  
>    stacktrace - This is one of the options that changes the trace
>  	       itself. When a trace is recorded, so is the stack
>  	       of functions. This allows for back traces of
>  	       trace sites.
>  
> +  trace_printk - Can disable trace_printk() from writing into the buffer.
> +
> +  branch - Enable branch tracing with the tracer.
> +
> +  annotate - It is sometimes confusing when the CPU buffers are full
> +  	     and one CPU buffer had a lot of events recently, thus
> +	     a shorter time frame, were another CPU may have only had
> +	     a few events, which lets it have older events. When
> +	     the trace is reported, it shows the oldest events first,
> +	     and it may look like only one CPU ran (the one with the
> +	     oldest events). When the annotate option is set, it will
> +	     display when a new CPU buffer started:
> +
> +          <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
> +          <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
> +          <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
> +##### CPU 2 buffer started ####
> +          <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
> +          <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
> +          <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
> +
>    userstacktrace - This option changes the trace. It records a
>  		   stacktrace of the current userspace thread.
>  
> @@ -451,9 +814,13 @@ Here are the available options:
>  		a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
>  x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
>  
> -  sched-tree - trace all tasks that are on the runqueue, at
> -	       every scheduling event. Will add overhead if
> -	       there's a lot of tasks running at once.
> +
> +  printk-msg-only - When set, trace_printk()s will only show the format
> +  		    and not their parameters (if trace_bprintk() or
> +		    trace_bputs() was used to save the trace_printk()).
> +
> +  context-info - Show only the event data. Hides the comm, PID,
> +  	         timestamp, CPU, and other useful data.
>  
>    latency-format - This option changes the trace. When
>                     it is enabled, the trace displays
> @@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
>                     latencies, as described in "Latency
>                     trace format".
>  
> +  sleep-time - When running function graph tracer, to include
> +  	       the time a task schedules out in its function.
> +	       When enabled, it will account time the task has been
> +	       scheduled out as part of the function call.
> +
> +  graph-time - When running function graph tracer, to include the
> +  	       time to call nested functions. When this is not set,
> +	       the time reported for the function will only include
> +	       the time the function itself executed for, not the time
> +	       for functions that it called.
> +
> +  record-cmd - When any event or tracer is enabled, a hook is enabled
> +  	       in the sched_switch trace point to fill comm cache
> +	       with mapped pids and comms. But this may cause some
> +	       overhead, and if you only care about pids, and not the
> +	       name of the task, disabling this option can lower the
> +	       impact of tracing.
> +
>    overwrite - This controls what happens when the trace buffer is
>                full. If "1" (default), the oldest events are
>                discarded and overwritten. If "0", then the newest
>                events are discarded.
> +	        (see per_cpu/cpu0/stats for overrun and dropped)
>  
> -ftrace_enabled
> ---------------
> +  disable_on_free - When the free_buffer is closed, tracing will
> +  		    stop (tracing_on set to 0).
>  
> -The following tracers (listed below) give different output
> -depending on whether or not the sysctl ftrace_enabled is set. To
> -set ftrace_enabled, one can either use the sysctl function or
> -set it via the proc file system interface.
> +  irq-info - Shows the interrupt, preempt count, need resched data.
> +  	     When disabled, the trace looks like:
>  
> -  sysctl kernel.ftrace_enabled=1
> +# tracer: function
> +#
> +# entries-in-buffer/entries-written: 144405/9452052   #P:4
> +#
> +#           TASK-PID   CPU#      TIMESTAMP  FUNCTION
> +#              | |       |          |         |
> +          <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
> +          <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
> +          <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
>  
> - or
>  
> -  echo 1 > /proc/sys/kernel/ftrace_enabled
> +  markers - When set, the trace_marker is writable (only by root).
> +  	    When disabled, the trace_marker will error with EINVAL
> +	    on write.
> +
> +
> +  function-trace - The latency tracers will enable function tracing
> +  	    if this option is enabled (default it is). When
> +	    it is disabled, the latency tracers do not trace
> +	    functions. This keeps the overhead of the tracer down
> +	    when performing latency tests.
>  
> -To disable ftrace_enabled simply replace the '1' with '0' in the
> -above commands.
> + Note: Some tracers have their own options. They only appear
> +       when the tracer is active.
>  
> -When ftrace_enabled is set the tracers will also record the
> -functions that are within the trace. The descriptions of the
> -tracers will also show an example with ftrace enabled.
>  
> 
>  irqsoff
> @@ -506,95 +903,133 @@ new trace is saved.
>  To reset the maximum, echo 0 into tracing_max_latency. Here is
>  an example:
>  
> + # echo 0 > options/function-trace
>   # echo irqsoff > current_tracer
> - # echo latency-format > trace_options
> - # echo 0 > tracing_max_latency
>   # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
>   # ls -ltr
>   [...]
>   # echo 0 > tracing_on
>   # cat trace
>  # tracer: irqsoff
>  #
> -irqsoff latency trace v1.1.5 on 2.6.26
> ---------------------------------------------------------------------
> - latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: sys_setpgid
> - => ended at:   sys_setpgid
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -    bash-3730  1d...    0us : _write_lock_irq (sys_setpgid)
> -    bash-3730  1d..1    1us+: _write_unlock_irq (sys_setpgid)
> -    bash-3730  1d..2   14us : trace_hardirqs_on (sys_setpgid)
> -
> -
> -Here we see that that we had a latency of 12 microsecs (which is
> -very good). The _write_lock_irq in sys_setpgid disabled
> -interrupts. The difference between the 12 and the displayed
> -timestamp 14us occurred because the clock was incremented
> +# irqsoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: run_timer_softirq
> +#  => ended at:   run_timer_softirq
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +  <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
> +  <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
> +  <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
> +  <idle>-0       0dNs3   25us : <stack trace>
> + => _raw_spin_unlock_irq
> + => run_timer_softirq
> + => __do_softirq
> + => call_softirq
> + => do_softirq
> + => irq_exit
> + => smp_apic_timer_interrupt
> + => apic_timer_interrupt
> + => rcu_idle_exit
> + => cpu_idle
> + => rest_init
> + => start_kernel
> + => x86_64_start_reservations
> + => x86_64_start_kernel
> +
> +Here we see that that we had a latency of 16 microseconds (which is
> +very good). The _raw_spin_lock_irq in run_timer_softirq disabled
> +interrupts. The difference between the 16 and the displayed
> +timestamp 25us occurred because the clock was incremented
>  between the time of recording the max latency and the time of
>  recording the function that had that latency.
>  
> -Note the above example had ftrace_enabled not set. If we set the
> -ftrace_enabled, we get a much larger output:
> +Note the above example had function-trace not set. If we set
> +function-trace, we get a much larger output:
> +
> + with echo 1 > options/function-trace
>  
>  # tracer: irqsoff
>  #
> -irqsoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: __alloc_pages_internal
> - => ended at:   __alloc_pages_internal
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -      ls-4339  0...1    0us+: get_page_from_freelist (__alloc_pages_internal)
> -      ls-4339  0d..1    3us : rmqueue_bulk (get_page_from_freelist)
> -      ls-4339  0d..1    3us : _spin_lock (rmqueue_bulk)
> -      ls-4339  0d..1    4us : add_preempt_count (_spin_lock)
> -      ls-4339  0d..2    4us : __rmqueue (rmqueue_bulk)
> -      ls-4339  0d..2    5us : __rmqueue_smallest (__rmqueue)
> -      ls-4339  0d..2    5us : __mod_zone_page_state (__rmqueue_smallest)
> -      ls-4339  0d..2    6us : __rmqueue (rmqueue_bulk)
> -      ls-4339  0d..2    6us : __rmqueue_smallest (__rmqueue)
> -      ls-4339  0d..2    7us : __mod_zone_page_state (__rmqueue_smallest)
> -      ls-4339  0d..2    7us : __rmqueue (rmqueue_bulk)
> -      ls-4339  0d..2    8us : __rmqueue_smallest (__rmqueue)
> +# irqsoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: ata_scsi_queuecmd
> +#  => ended at:   ata_scsi_queuecmd
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +    bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
> +    bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
> +    bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
> +    bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
> +    bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
> +    bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
> +    bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
> +    bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
> +    bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
>  [...]
> -      ls-4339  0d..2   46us : __rmqueue_smallest (__rmqueue)
> -      ls-4339  0d..2   47us : __mod_zone_page_state (__rmqueue_smallest)
> -      ls-4339  0d..2   47us : __rmqueue (rmqueue_bulk)
> -      ls-4339  0d..2   48us : __rmqueue_smallest (__rmqueue)
> -      ls-4339  0d..2   48us : __mod_zone_page_state (__rmqueue_smallest)
> -      ls-4339  0d..2   49us : _spin_unlock (rmqueue_bulk)
> -      ls-4339  0d..2   49us : sub_preempt_count (_spin_unlock)
> -      ls-4339  0d..1   50us : get_page_from_freelist (__alloc_pages_internal)
> -      ls-4339  0d..2   51us : trace_hardirqs_on (__alloc_pages_internal)
> -
> -
> -
> -Here we traced a 50 microsecond latency. But we also see all the
> +    bash-2042    3d..1   67us : delay_tsc <-__delay
> +    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
> +    bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
> +    bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
> +    bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
> +    bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
> +    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
> +    bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
> +    bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
> +    bash-2042    3d..1  120us : <stack trace>
> + => _raw_spin_unlock_irqrestore
> + => ata_scsi_queuecmd
> + => scsi_dispatch_cmd
> + => scsi_request_fn
> + => __blk_run_queue_uncond
> + => __blk_run_queue
> + => blk_queue_bio
> + => generic_make_request
> + => submit_bio
> + => submit_bh
> + => __ext3_get_inode_loc
> + => ext3_iget
> + => ext3_lookup
> + => lookup_real
> + => __lookup_hash
> + => walk_component
> + => lookup_last
> + => path_lookupat
> + => filename_lookup
> + => user_path_at_empty
> + => user_path_at
> + => vfs_fstatat
> + => vfs_stat
> + => sys_newstat
> + => system_call_fastpath
> +
> +
> +Here we traced a 71 microsecond latency. But we also see all the
>  functions that were called during that time. Note that by
>  enabling function tracing, we incur an added overhead. This
>  overhead may extend the latency times. But nevertheless, this
> @@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for
>  which preemption was disabled. The control of preemptoff tracer
>  is much like the irqsoff tracer.
>  
> + # echo 0 > options/function-trace
>   # echo preemptoff > current_tracer
> - # echo latency-format > trace_options
> - # echo 0 > tracing_max_latency
>   # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
>   # ls -ltr
>   [...]
>   # echo 0 > tracing_on
>   # cat trace
>  # tracer: preemptoff
>  #
> -preemptoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: do_IRQ
> - => ended at:   __do_softirq
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -    sshd-4261  0d.h.    0us+: irq_enter (do_IRQ)
> -    sshd-4261  0d.s.   29us : _local_bh_enable (__do_softirq)
> -    sshd-4261  0d.s1   30us : trace_preempt_on (__do_softirq)
> +# preemptoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: do_IRQ
> +#  => ended at:   do_IRQ
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +    sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
> +    sshd-1991    1d..1   46us : irq_exit <-do_IRQ
> +    sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
> +    sshd-1991    1d..1   52us : <stack trace>
> + => sub_preempt_count
> + => irq_exit
> + => do_IRQ
> + => ret_from_intr
>  
> 
>  This has some more changes. Preemption was disabled when an
> -interrupt came in (notice the 'h'), and was enabled while doing
> -a softirq. (notice the 's'). But we also see that interrupts
> -have been disabled when entering the preempt off section and
> -leaving it (the 'd'). We do not know if interrupts were enabled
> -in the mean time.
> +interrupt came in (notice the 'h'), and was enabled on exit.
> +But we also see that interrupts have been disabled when entering
> +the preempt off section and leaving it (the 'd'). We do not know if
> +interrupts were enabled in the mean time or shortly after this
> +was over.
>  
>  # tracer: preemptoff
>  #
> -preemptoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: remove_wait_queue
> - => ended at:   __do_softirq
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -    sshd-4261  0d..1    0us : _spin_lock_irqsave (remove_wait_queue)
> -    sshd-4261  0d..1    1us : _spin_unlock_irqrestore (remove_wait_queue)
> -    sshd-4261  0d..1    2us : do_IRQ (common_interrupt)
> -    sshd-4261  0d..1    2us : irq_enter (do_IRQ)
> -    sshd-4261  0d..1    2us : idle_cpu (irq_enter)
> -    sshd-4261  0d..1    3us : add_preempt_count (irq_enter)
> -    sshd-4261  0d.h1    3us : idle_cpu (irq_enter)
> -    sshd-4261  0d.h.    4us : handle_fasteoi_irq (do_IRQ)
> +# preemptoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: wake_up_new_task
> +#  => ended at:   task_rq_unlock
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +    bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
> +    bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
> +    bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
> +    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
> +    bash-1994    1d..1    1us : source_load <-select_task_rq_fair
>  [...]
> -    sshd-4261  0d.h.   12us : add_preempt_count (_spin_lock)
> -    sshd-4261  0d.h1   12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
> -    sshd-4261  0d.h1   13us : move_native_irq (ack_ioapic_quirk_irq)
> -    sshd-4261  0d.h1   13us : _spin_unlock (handle_fasteoi_irq)
> -    sshd-4261  0d.h1   14us : sub_preempt_count (_spin_unlock)
> -    sshd-4261  0d.h1   14us : irq_exit (do_IRQ)
> -    sshd-4261  0d.h1   15us : sub_preempt_count (irq_exit)
> -    sshd-4261  0d..2   15us : do_softirq (irq_exit)
> -    sshd-4261  0d...   15us : __do_softirq (do_softirq)
> -    sshd-4261  0d...   16us : __local_bh_disable (__do_softirq)
> -    sshd-4261  0d...   16us+: add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s4   20us : add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s4   21us : sub_preempt_count (local_bh_enable)
> -    sshd-4261  0d.s5   21us : sub_preempt_count (local_bh_enable)
> +    bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
> +    bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
> +    bash-1994    1d..1   13us : add_preempt_count <-irq_enter
> +    bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
> +    bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
> +    bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
> +    bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
> +    bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
>  [...]
> -    sshd-4261  0d.s6   41us : add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s6   42us : sub_preempt_count (local_bh_enable)
> -    sshd-4261  0d.s7   42us : sub_preempt_count (local_bh_enable)
> -    sshd-4261  0d.s5   43us : add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s5   43us : sub_preempt_count (local_bh_enable_ip)
> -    sshd-4261  0d.s6   44us : sub_preempt_count (local_bh_enable_ip)
> -    sshd-4261  0d.s5   44us : add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s5   45us : sub_preempt_count (local_bh_enable)
> +    bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
> +    bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
> +    bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
> +    bash-1994    1d..2   36us : do_softirq <-irq_exit
> +    bash-1994    1d..2   36us : __do_softirq <-call_softirq
> +    bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
> +    bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
> +    bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
> +    bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
> +    bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
>  [...]
> -    sshd-4261  0d.s.   63us : _local_bh_enable (__do_softirq)
> -    sshd-4261  0d.s1   64us : trace_preempt_on (__do_softirq)
> +    bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
> +    bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
> +    bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
> +    bash-1994    1dN.2   82us : idle_cpu <-irq_exit
> +    bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
> +    bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
> +    bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
> +    bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
> +    bash-1994    1.N.1  104us : <stack trace>
> + => sub_preempt_count
> + => _raw_spin_unlock_irqrestore
> + => task_rq_unlock
> + => wake_up_new_task
> + => do_fork
> + => sys_clone
> + => stub_clone
>  
> 
>  The above is an example of the preemptoff trace with
> -ftrace_enabled set. Here we see that interrupts were disabled
> +function-trace set. Here we see that interrupts were not disabled
>  the entire time. The irq_enter code lets us know that we entered
>  an interrupt 'h'. Before that, the functions being traced still
>  show that it is not in an interrupt, but we can see from the
>  functions themselves that this is not the case.
>  
> -Notice that __do_softirq when called does not have a
> -preempt_count. It may seem that we missed a preempt enabling.
> -What really happened is that the preempt count is held on the
> -thread's stack and we switched to the softirq stack (4K stacks
> -in effect). The code does not copy the preempt count, but
> -because interrupts are disabled, we do not need to worry about
> -it. Having a tracer like this is good for letting people know
> -what really happens inside the kernel.
> -
> -
>  preemptirqsoff
>  --------------
>  
> @@ -762,38 +1199,57 @@ tracer.
>  Again, using this trace is much like the irqsoff and preemptoff
>  tracers.
>  
> + # echo 0 > options/function-trace
>   # echo preemptirqsoff > current_tracer
> - # echo latency-format > trace_options
> - # echo 0 > tracing_max_latency
>   # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
>   # ls -ltr
>   [...]
>   # echo 0 > tracing_on
>   # cat trace
>  # tracer: preemptirqsoff
>  #
> -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: apic_timer_interrupt
> - => ended at:   __do_softirq
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -      ls-4860  0d...    0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
> -      ls-4860  0d.s.  294us : _local_bh_enable (__do_softirq)
> -      ls-4860  0d.s1  294us : trace_preempt_on (__do_softirq)
> -
> +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: ata_scsi_queuecmd
> +#  => ended at:   ata_scsi_queuecmd
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +      ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
> +      ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
> +      ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
> +      ls-2230    3...1  111us : <stack trace>
> + => sub_preempt_count
> + => _raw_spin_unlock_irqrestore
> + => ata_scsi_queuecmd
> + => scsi_dispatch_cmd
> + => scsi_request_fn
> + => __blk_run_queue_uncond
> + => __blk_run_queue
> + => blk_queue_bio
> + => generic_make_request
> + => submit_bio
> + => submit_bh
> + => ext3_bread
> + => ext3_dir_bread
> + => htree_dirblock_to_tree
> + => ext3_htree_fill_tree
> + => ext3_readdir
> + => vfs_readdir
> + => sys_getdents
> + => system_call_fastpath
>  
> 
>  The trace_hardirqs_off_thunk is called from assembly on x86 when
> @@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled
>  within the preemption points. We do see that it started with
>  preemption enabled.
>  
> -Here is a trace with ftrace_enabled set:
> -
> +Here is a trace with function-trace set:
>  
>  # tracer: preemptirqsoff
>  #
> -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
> -    -----------------
> - => started at: write_chan
> - => ended at:   __do_softirq
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -      ls-4473  0.N..    0us : preempt_schedule (write_chan)
> -      ls-4473  0dN.1    1us : _spin_lock (schedule)
> -      ls-4473  0dN.1    2us : add_preempt_count (_spin_lock)
> -      ls-4473  0d..2    2us : put_prev_task_fair (schedule)
> -[...]
> -      ls-4473  0d..2   13us : set_normalized_timespec (ktime_get_ts)
> -      ls-4473  0d..2   13us : __switch_to (schedule)
> -    sshd-4261  0d..2   14us : finish_task_switch (schedule)
> -    sshd-4261  0d..2   14us : _spin_unlock_irq (finish_task_switch)
> -    sshd-4261  0d..1   15us : add_preempt_count (_spin_lock_irqsave)
> -    sshd-4261  0d..2   16us : _spin_unlock_irqrestore (hrtick_set)
> -    sshd-4261  0d..2   16us : do_IRQ (common_interrupt)
> -    sshd-4261  0d..2   17us : irq_enter (do_IRQ)
> -    sshd-4261  0d..2   17us : idle_cpu (irq_enter)
> -    sshd-4261  0d..2   18us : add_preempt_count (irq_enter)
> -    sshd-4261  0d.h2   18us : idle_cpu (irq_enter)
> -    sshd-4261  0d.h.   18us : handle_fasteoi_irq (do_IRQ)
> -    sshd-4261  0d.h.   19us : _spin_lock (handle_fasteoi_irq)
> -    sshd-4261  0d.h.   19us : add_preempt_count (_spin_lock)
> -    sshd-4261  0d.h1   20us : _spin_unlock (handle_fasteoi_irq)
> -    sshd-4261  0d.h1   20us : sub_preempt_count (_spin_unlock)
> -[...]
> -    sshd-4261  0d.h1   28us : _spin_unlock (handle_fasteoi_irq)
> -    sshd-4261  0d.h1   29us : sub_preempt_count (_spin_unlock)
> -    sshd-4261  0d.h2   29us : irq_exit (do_IRQ)
> -    sshd-4261  0d.h2   29us : sub_preempt_count (irq_exit)
> -    sshd-4261  0d..3   30us : do_softirq (irq_exit)
> -    sshd-4261  0d...   30us : __do_softirq (do_softirq)
> -    sshd-4261  0d...   31us : __local_bh_disable (__do_softirq)
> -    sshd-4261  0d...   31us+: add_preempt_count (__local_bh_disable)
> -    sshd-4261  0d.s4   34us : add_preempt_count (__local_bh_disable)
> +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
> +#    -----------------
> +#  => started at: schedule
> +#  => ended at:   mutex_unlock
> +#
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +kworker/-59      3...1    0us : __schedule <-schedule
> +kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
> +kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
> +kworker/-59      3d..2    1us : deactivate_task <-__schedule
> +kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
> +kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
> +kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
> +kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
> +kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
> +kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
> +kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
> +kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
> +kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
> +kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
> +kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
> +kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
> +kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
> +kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
> +kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
> +kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
> +kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
> +kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
> +kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
> +kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
> +kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
> +      ls-2269    3d..2    7us : finish_task_switch <-__schedule
> +      ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
> +      ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
> +      ls-2269    3d..2    8us : irq_enter <-do_IRQ
> +      ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
> +      ls-2269    3d..2    9us : add_preempt_count <-irq_enter
> +      ls-2269    3d.h2    9us : exit_idle <-do_IRQ
>  [...]
> -    sshd-4261  0d.s3   43us : sub_preempt_count (local_bh_enable_ip)
> -    sshd-4261  0d.s4   44us : sub_preempt_count (local_bh_enable_ip)
> -    sshd-4261  0d.s3   44us : smp_apic_timer_interrupt (apic_timer_interrupt)
> -    sshd-4261  0d.s3   45us : irq_enter (smp_apic_timer_interrupt)
> -    sshd-4261  0d.s3   45us : idle_cpu (irq_enter)
> -    sshd-4261  0d.s3   46us : add_preempt_count (irq_enter)
> -    sshd-4261  0d.H3   46us : idle_cpu (irq_enter)
> -    sshd-4261  0d.H3   47us : hrtimer_interrupt (smp_apic_timer_interrupt)
> -    sshd-4261  0d.H3   47us : ktime_get (hrtimer_interrupt)
> +      ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
> +      ls-2269    3d.h2   20us : irq_exit <-do_IRQ
> +      ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
> +      ls-2269    3d..3   21us : do_softirq <-irq_exit
> +      ls-2269    3d..3   21us : __do_softirq <-call_softirq
> +      ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
> +      ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
> +      ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
> +      ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
> +      ls-2269    3d.s5   31us : irq_enter <-do_IRQ
> +      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
>  [...]
> -    sshd-4261  0d.H3   81us : tick_program_event (hrtimer_interrupt)
> -    sshd-4261  0d.H3   82us : ktime_get (tick_program_event)
> -    sshd-4261  0d.H3   82us : ktime_get_ts (ktime_get)
> -    sshd-4261  0d.H3   83us : getnstimeofday (ktime_get_ts)
> -    sshd-4261  0d.H3   83us : set_normalized_timespec (ktime_get_ts)
> -    sshd-4261  0d.H3   84us : clockevents_program_event (tick_program_event)
> -    sshd-4261  0d.H3   84us : lapic_next_event (clockevents_program_event)
> -    sshd-4261  0d.H3   85us : irq_exit (smp_apic_timer_interrupt)
> -    sshd-4261  0d.H3   85us : sub_preempt_count (irq_exit)
> -    sshd-4261  0d.s4   86us : sub_preempt_count (irq_exit)
> -    sshd-4261  0d.s3   86us : add_preempt_count (__local_bh_disable)
> +      ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
> +      ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
> +      ls-2269    3d.H5   32us : exit_idle <-do_IRQ
> +      ls-2269    3d.H5   32us : handle_irq <-do_IRQ
> +      ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
> +      ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
>  [...]
> -    sshd-4261  0d.s1   98us : sub_preempt_count (net_rx_action)
> -    sshd-4261  0d.s.   99us : add_preempt_count (_spin_lock_irq)
> -    sshd-4261  0d.s1   99us+: _spin_unlock_irq (run_timer_softirq)
> -    sshd-4261  0d.s.  104us : _local_bh_enable (__do_softirq)
> -    sshd-4261  0d.s.  104us : sub_preempt_count (_local_bh_enable)
> -    sshd-4261  0d.s.  105us : _local_bh_enable (__do_softirq)
> -    sshd-4261  0d.s1  105us : trace_preempt_on (__do_softirq)
> -
> -
> -This is a very interesting trace. It started with the preemption
> -of the ls task. We see that the task had the "need_resched" bit
> -set via the 'N' in the trace.  Interrupts were disabled before
> -the spin_lock at the beginning of the trace. We see that a
> -schedule took place to run sshd.  When the interrupts were
> -enabled, we took an interrupt. On return from the interrupt
> -handler, the softirq ran. We took another interrupt while
> -running the softirq as we see from the capital 'H'.
> +      ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
> +      ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
> +      ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
> +      ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
> +      ls-2269    3d..3  159us : idle_cpu <-irq_exit
> +      ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
> +      ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
> +      ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
> +      ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
> +      ls-2269    3d...  186us : <stack trace>
> + => __mutex_unlock_slowpath
> + => mutex_unlock
> + => process_output
> + => n_tty_write
> + => tty_write
> + => vfs_write
> + => sys_write
> + => system_call_fastpath
> +
> +This is an interesting trace. It started with kworker running and
> +scheduling out and ls taking over. But as soon as ls released the
> +rq lock and enabled interrupts (but not preemption) an interrupt
> +triggered. When the interrupt finished, it started running softirqs.
> +But while the softirq was running, another interrupt triggered.
> +When an interrupt is running inside a softirq, the annotation is 'H'.
>  
> 
>  wakeup
>  ------
>  
> +One common case that people are interested in tracing is the
> +time it takes for a task that is woken to actually wake up.
> +Now for non Real-Time tasks, this can be arbitrary. But tracing
> +it none the less can be interesting. 
> +
> +Without function tracing:
> +
> + # echo 0 > options/function-trace
> + # echo wakeup > current_tracer
> + # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
> + # chrt -f 5 sleep 1
> + # echo 0 > tracing_on
> + # cat trace
> +# tracer: wakeup
> +#
> +# wakeup latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
> +#    -----------------
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +  <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
> +  <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
> +  <idle>-0       3d..3   15us : __schedule <-schedule
> +  <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
> +
> +The tracer only traces the highest priority task in the system
> +to avoid tracing the normal circumstances. Here we see that
> +the kworker with a nice priority of -20 (not very nice), took
> +just 15 microseconds from the time it woke up, to the time it
> +ran.
> +
> +Non Real-Time tasks are not that interesting. A more interesting
> +trace is to concentrate only on Real-Time tasks.
> +
> +wakeup_rt
> +---------
> +
>  In a Real-Time environment it is very important to know the
>  wakeup time it takes for the highest priority task that is woken
>  up to the time that it executes. This is also known as "schedule
> @@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency.
>  That is the longest latency it takes for something to happen,
>  and not the average. We can have a very fast scheduler that may
>  only have a large latency once in a while, but that would not
> -work well with Real-Time tasks.  The wakeup tracer was designed
> +work well with Real-Time tasks.  The wakeup_rt tracer was designed
>  to record the worst case wakeups of RT tasks. Non-RT tasks are
>  not recorded because the tracer only records one worst case and
>  tracing non-RT tasks that are unpredictable will overwrite the
> -worst case latency of RT tasks.
> +worst case latency of RT tasks (just run the normal wakeup
> +tracer for a while to see that effect).
>  
>  Since this tracer only deals with RT tasks, we will run this
>  slightly differently than we did with the previous tracers.
>  Instead of performing an 'ls', we will run 'sleep 1' under
>  'chrt' which changes the priority of the task.
>  
> - # echo wakeup > current_tracer
> - # echo latency-format > trace_options
> - # echo 0 > tracing_max_latency
> + # echo 0 > options/function-trace
> + # echo wakeup_rt > current_tracer
>   # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
>   # chrt -f 5 sleep 1
>   # echo 0 > tracing_on
>   # cat trace
>  # tracer: wakeup
>  #
> -wakeup latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
> -    -----------------
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -  <idle>-0     1d.h4    0us+: try_to_wake_up (wake_up_process)
> -  <idle>-0     1d..4    4us : schedule (cpu_idle)
> -
> -
> -Running this on an idle system, we see that it only took 4
> -microseconds to perform the task switch.  Note, since the trace
> -marker in the schedule is before the actual "switch", we stop
> -the tracing when the recorded task is about to schedule in. This
> -may change if we add a new marker at the end of the scheduler.
> -
> -Notice that the recorded task is 'sleep' with the PID of 4901
> +# tracer: wakeup_rt
> +#
> +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
> +#    -----------------
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +  <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
> +  <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
> +  <idle>-0       3d..3    5us : __schedule <-schedule
> +  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
> +
> +
> +Running this on an idle system, we see that it only took 5 microseconds
> +to perform the task switch.  Note, since the trace point in the schedule
> +is before the actual "switch", we stop the tracing when the recorded task
> +is about to schedule in. This may change if we add a new marker at the
> +end of the scheduler.
> +
> +Notice that the recorded task is 'sleep' with the PID of 2389
>  and it has an rt_prio of 5. This priority is user-space priority
>  and not the internal kernel priority. The policy is 1 for
>  SCHED_FIFO and 2 for SCHED_RR.
>  
> -Doing the same with chrt -r 5 and ftrace_enabled set.
> +Note, that the trace data shows the internal priority (99 - rtprio).
>  
> -# tracer: wakeup
> +  <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
> +
> +The 0:120:R means idle was running with a nice priority of 0 (120 - 20)
> +and in the running state 'R'. The sleep task was scheduled in with
> +2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
> +and it too is in the running state.
> +
> +Doing the same with chrt -r 5 and function-trace set.
> +
> +  echo 1 > options/function-trace
> +
> +# tracer: wakeup_rt
>  #
> -wakeup latency trace v1.1.5 on 2.6.26-rc8
> ---------------------------------------------------------------------
> - latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
> -    -----------------
> -    | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
> -    -----------------
> -
> -#                _------=> CPU#
> -#               / _-----=> irqs-off
> -#              | / _----=> need-resched
> -#              || / _---=> hardirq/softirq
> -#              ||| / _--=> preempt-depth
> -#              |||| /
> -#              |||||     delay
> -#  cmd     pid ||||| time  |   caller
> -#     \   /    |||||   \   |   /
> -ksoftirq-7     1d.H3    0us : try_to_wake_up (wake_up_process)
> -ksoftirq-7     1d.H4    1us : sub_preempt_count (marker_probe_cb)
> -ksoftirq-7     1d.H3    2us : check_preempt_wakeup (try_to_wake_up)
> -ksoftirq-7     1d.H3    3us : update_curr (check_preempt_wakeup)
> -ksoftirq-7     1d.H3    4us : calc_delta_mine (update_curr)
> -ksoftirq-7     1d.H3    5us : __resched_task (check_preempt_wakeup)
> -ksoftirq-7     1d.H3    6us : task_wake_up_rt (try_to_wake_up)
> -ksoftirq-7     1d.H3    7us : _spin_unlock_irqrestore (try_to_wake_up)
> -[...]
> -ksoftirq-7     1d.H2   17us : irq_exit (smp_apic_timer_interrupt)
> -ksoftirq-7     1d.H2   18us : sub_preempt_count (irq_exit)
> -ksoftirq-7     1d.s3   19us : sub_preempt_count (irq_exit)
> -ksoftirq-7     1..s2   20us : rcu_process_callbacks (__do_softirq)
> -[...]
> -ksoftirq-7     1..s2   26us : __rcu_process_callbacks (rcu_process_callbacks)
> -ksoftirq-7     1d.s2   27us : _local_bh_enable (__do_softirq)
> -ksoftirq-7     1d.s2   28us : sub_preempt_count (_local_bh_enable)
> -ksoftirq-7     1.N.3   29us : sub_preempt_count (ksoftirqd)
> -ksoftirq-7     1.N.2   30us : _cond_resched (ksoftirqd)
> -ksoftirq-7     1.N.2   31us : __cond_resched (_cond_resched)
> -ksoftirq-7     1.N.2   32us : add_preempt_count (__cond_resched)
> -ksoftirq-7     1.N.2   33us : schedule (__cond_resched)
> -ksoftirq-7     1.N.2   33us : add_preempt_count (schedule)
> -ksoftirq-7     1.N.3   34us : hrtick_clear (schedule)
> -ksoftirq-7     1dN.3   35us : _spin_lock (schedule)
> -ksoftirq-7     1dN.3   36us : add_preempt_count (_spin_lock)
> -ksoftirq-7     1d..4   37us : put_prev_task_fair (schedule)
> -ksoftirq-7     1d..4   38us : update_curr (put_prev_task_fair)
> -[...]
> -ksoftirq-7     1d..5   47us : _spin_trylock (tracing_record_cmdline)
> -ksoftirq-7     1d..5   48us : add_preempt_count (_spin_trylock)
> -ksoftirq-7     1d..6   49us : _spin_unlock (tracing_record_cmdline)
> -ksoftirq-7     1d..6   49us : sub_preempt_count (_spin_unlock)
> -ksoftirq-7     1d..4   50us : schedule (__cond_resched)
> -
> -The interrupt went off while running ksoftirqd. This task runs
> -at SCHED_OTHER. Why did not we see the 'N' set early? This may
> -be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
> -stacks configured, the interrupt and softirq run with their own
> -stack. Some information is held on the top of the task's stack
> -(need_resched and preempt_count are both stored there). The
> -setting of the NEED_RESCHED bit is done directly to the task's
> -stack, but the reading of the NEED_RESCHED is done by looking at
> -the current stack, which in this case is the stack for the hard
> -interrupt. This hides the fact that NEED_RESCHED has been set.
> -We do not see the 'N' until we switch back to the task's
> -assigned stack.
> +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
> +#    -----------------
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +  <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
> +  <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
> +  <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
> +  <idle>-0       3d.h3    3us : resched_task <-check_preempt_curr
> +  <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
> +  <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
> +  <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
> +  <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
> +  <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
> +  <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
> +  <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
> +  <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
> +  <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
> +  <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
> +  <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
> +  <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
> +  <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
> +  <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
> +  <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
> +  <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
> +  <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
> +  <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
> +  <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
> +  <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
> +  <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
> +  <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
> +  <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
> +  <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   13us : update_cpu_load_nohz <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   13us : _raw_spin_lock <-update_cpu_load_nohz
> +  <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
> +  <idle>-0       3dN.2   13us : __update_cpu_load <-update_cpu_load_nohz
> +  <idle>-0       3dN.2   14us : sched_avg_update <-__update_cpu_load
> +  <idle>-0       3dN.2   14us : _raw_spin_unlock <-update_cpu_load_nohz
> +  <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
> +  <idle>-0       3dN.1   15us : calc_load_exit_idle <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
> +  <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
> +  <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
> +  <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
> +  <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
> +  <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
> +  <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
> +  <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
> +  <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
> +  <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
> +  <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
> +  <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
> +  <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
> +  <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
> +  <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
> +  <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
> +  <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
> +  <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
> +  <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
> +  <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
> +  <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
> +  <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
> +  <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
> +  <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
> +  <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
> +  <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
> +  <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
> +  <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
> +  <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
> +  <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
> +  <idle>-0       3.N..   25us : schedule <-cpu_idle
> +  <idle>-0       3.N..   25us : __schedule <-preempt_schedule
> +  <idle>-0       3.N..   26us : add_preempt_count <-__schedule
> +  <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
> +  <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
> +  <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
> +  <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
> +  <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
> +  <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
> +  <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
> +  <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
> +  <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
> +  <idle>-0       3d..3   29us : __schedule <-preempt_schedule
> +  <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
> +
> +This isn't that big of a trace, even with function tracing enabled,
> +so I included the entire trace.
> +
> +The interrupt went off while when the system was idle. Somewhere
> +before task_woken_rt() was called, the NEED_RESCHED flag was set,
> +this is indicated by the first occurrence of the 'N' flag.
> +
> +Latency tracing and events
> +--------------------------
> +As function tracing can induce a much larger latency, but without
> +seeing what happens within the latency it is hard to know what
> +caused it. There is a middle ground, and that is with enabling
> +events.
> +
> + # echo 0 > options/function-trace
> + # echo wakeup_rt > current_tracer
> + # echo 1 > events/enable
> + # echo 1 > tracing_on
> + # echo 0 > tracing_max_latency
> + # chrt -f 5 sleep 1
> + # echo 0 > tracing_on
> + # cat trace
> +# tracer: wakeup_rt
> +#
> +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
> +# --------------------------------------------------------------------
> +# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
> +#    -----------------
> +#    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
> +#    -----------------
> +#
> +#                  _------=> CPU#            
> +#                 / _-----=> irqs-off        
> +#                | / _----=> need-resched    
> +#                || / _---=> hardirq/softirq 
> +#                ||| / _--=> preempt-depth   
> +#                |||| /     delay             
> +#  cmd     pid   ||||| time  |   caller      
> +#     \   /      |||||  \    |   /           
> +  <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
> +  <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
> +  <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
> +  <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
> +  <idle>-0       2.N.2    2us : power_end: cpu_id=2
> +  <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
> +  <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
> +  <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
> +  <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
> +  <idle>-0       2.N.2    5us : rcu_utilization: End context switch
> +  <idle>-0       2d..3    6us : __schedule <-schedule
> +  <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
> +
>  
>  function
>  --------
> @@ -1039,6 +1653,7 @@ function
>  This tracer is the function tracer. Enabling the function tracer
>  can be done from the debug file system. Make sure the
>  ftrace_enabled is set; otherwise this tracer is a nop.
> +See the "ftrace_enabled" section below.
>  
>   # sysctl kernel.ftrace_enabled=1
>   # echo function > current_tracer
> @@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop.
>   # cat trace
>  # tracer: function
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> -            bash-4003  [00]   123.638713: finish_task_switch <-schedule
> -            bash-4003  [00]   123.638714: _spin_unlock_irq <-finish_task_switch
> -            bash-4003  [00]   123.638714: sub_preempt_count <-_spin_unlock_irq
> -            bash-4003  [00]   123.638715: hrtick_set <-schedule
> -            bash-4003  [00]   123.638715: _spin_lock_irqsave <-hrtick_set
> -            bash-4003  [00]   123.638716: add_preempt_count <-_spin_lock_irqsave
> -            bash-4003  [00]   123.638716: _spin_unlock_irqrestore <-hrtick_set
> -            bash-4003  [00]   123.638717: sub_preempt_count <-_spin_unlock_irqrestore
> -            bash-4003  [00]   123.638717: hrtick_clear <-hrtick_set
> -            bash-4003  [00]   123.638718: sub_preempt_count <-schedule
> -            bash-4003  [00]   123.638718: sub_preempt_count <-preempt_schedule
> -            bash-4003  [00]   123.638719: wait_for_completion <-__stop_machine_run
> -            bash-4003  [00]   123.638719: wait_for_common <-wait_for_completion
> -            bash-4003  [00]   123.638720: _spin_lock_irq <-wait_for_common
> -            bash-4003  [00]   123.638720: add_preempt_count <-_spin_lock_irq
> +# entries-in-buffer/entries-written: 24799/24799   #P:4
> +#
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +            bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
> +            bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
> +            bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
> +            bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
> +            bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
> +            bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
> +            bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
> +            bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
>  [...]
>  
> 
> @@ -1214,79 +1829,19 @@ int main (int argc, char **argv)
>          return 0;
>  }
>  
> +Or this simple script!
>  
> -hw-branch-tracer (x86 only)
> ----------------------------
> -
> -This tracer uses the x86 last branch tracing hardware feature to
> -collect a branch trace on all cpus with relatively low overhead.
> -
> -The tracer uses a fixed-size circular buffer per cpu and only
> -traces ring 0 branches. The trace file dumps that buffer in the
> -following format:
> -
> -# tracer: hw-branch-tracer
> -#
> -# CPU#        TO  <-  FROM
> -   0  scheduler_tick+0xb5/0x1bf	  <-  task_tick_idle+0x5/0x6
> -   2  run_posix_cpu_timers+0x2b/0x72a	  <-  run_posix_cpu_timers+0x25/0x72a
> -   0  scheduler_tick+0x139/0x1bf	  <-  scheduler_tick+0xed/0x1bf
> -   0  scheduler_tick+0x17c/0x1bf	  <-  scheduler_tick+0x148/0x1bf
> -   2  run_posix_cpu_timers+0x9e/0x72a	  <-  run_posix_cpu_timers+0x5e/0x72a
> -   0  scheduler_tick+0x1b6/0x1bf	  <-  scheduler_tick+0x1aa/0x1bf
> -
> -
> -The tracer may be used to dump the trace for the oops'ing cpu on
> -a kernel oops into the system log. To enable this,
> -ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
> -can either use the sysctl function or set it via the proc system
> -interface.
> -
> -  sysctl kernel.ftrace_dump_on_oops=n
> -
> -or
> -
> -  echo n > /proc/sys/kernel/ftrace_dump_on_oops
> -
> -If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
> -only dump the buffer of the CPU that triggered the oops.
> -
> -Here's an example of such a dump after a null pointer
> -dereference in a kernel module:
> -
> -[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> -[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
> -[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
> -[57848.106019] Oops: 0002 [#1] SMP
> -[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
> -[57848.106019] Dumping ftrace buffer:
> -[57848.106019] ---------------------------------
> -[...]
> -[57848.106019]    0  chrdev_open+0xe6/0x165	  <-  cdev_put+0x23/0x24
> -[57848.106019]    0  chrdev_open+0x117/0x165	  <-  chrdev_open+0xfa/0x165
> -[57848.106019]    0  chrdev_open+0x120/0x165	  <-  chrdev_open+0x11c/0x165
> -[57848.106019]    0  chrdev_open+0x134/0x165	  <-  chrdev_open+0x12b/0x165
> -[57848.106019]    0  open+0x0/0x14 [oops]	  <-  chrdev_open+0x144/0x165
> -[57848.106019]    0  page_fault+0x0/0x30	  <-  open+0x6/0x14 [oops]
> -[57848.106019]    0  error_entry+0x0/0x5b	  <-  page_fault+0x4/0x30
> -[57848.106019]    0  error_kernelspace+0x0/0x31	  <-  error_entry+0x59/0x5b
> -[57848.106019]    0  error_sti+0x0/0x1	  <-  error_kernelspace+0x2d/0x31
> -[57848.106019]    0  page_fault+0x9/0x30	  <-  error_sti+0x0/0x1
> -[57848.106019]    0  do_page_fault+0x0/0x881	  <-  page_fault+0x1a/0x30
> -[...]
> -[57848.106019]    0  do_page_fault+0x66b/0x881	  <-  is_prefetch+0x1ee/0x1f2
> -[57848.106019]    0  do_page_fault+0x6e0/0x881	  <-  do_page_fault+0x67a/0x881
> -[57848.106019]    0  oops_begin+0x0/0x96	  <-  do_page_fault+0x6e0/0x881
> -[57848.106019]    0  trace_hw_branch_oops+0x0/0x2d	  <-  oops_begin+0x9/0x96
> -[...]
> -[57848.106019]    0  ds_suspend_bts+0x2a/0xe3	  <-  ds_suspend_bts+0x1a/0xe3
> -[57848.106019] ---------------------------------
> -[57848.106019] CPU 0
> -[57848.106019] Modules linked in: oops
> -[57848.106019] Pid: 5542, comm: cat Tainted: G        W  2.6.28 #23
> -[57848.106019] RIP: 0010:[<ffffffffa0000006>]  [<ffffffffa0000006>] open+0x6/0x14 [oops]
> -[57848.106019] RSP: 0018:ffff880235457d48  EFLAGS: 00010246
> -[...]
> +------
> +#!/bin/bash
> +
> +debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts`
> +echo nop > $debugfs/tracing/current_tracer
> +echo 0 > $debugfs/tracing/tracing_on
> +echo $$ > $debugfs/tracing/set_ftrace_pid
> +echo function > $debugfs/tracing/current_tracer
> +echo 1 > $debugfs/tracing/tracing_on
> +exec "$@"
> +------
>  
> 
>  function graph tracer
> @@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will
>  include the -pg switch in the compiling of the kernel.)
>  
>  At compile time every C file object is run through the
> -recordmcount.pl script (located in the scripts directory). This
> -script will process the C object using objdump to find all the
> -locations in the .text section that call mcount. (Note, only the
> -.text section is processed, since processing other sections like
> -.init.text may cause races due to those sections being freed).
> +recordmcount program (located in the scripts directory). This
> +program will parse the ELF headers in the C object to find all
> +the locations in the .text section that call mcount. (Note, only
> +white listed .text sections are processed, since processing other
> +sections like .init.text may cause races due to those sections
> +being freed unexpectedly).
>  
>  A new section called "__mcount_loc" is created that holds
>  references to all the mcount call sites in the .text section.
> -This section is compiled back into the original object. The
> -final linker will add all these references into a single table.
> +The recordmcount program re-links this section back into the
> +original object. The final linking stage of the kernel will add all these
> +references into a single table.
>  
>  On boot up, before SMP is initialized, the dynamic ftrace code
>  scans this table and updates all the locations into nops. It
> @@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function
>  list. This is automatic in the module unload code, and the
>  module author does not need to worry about it.
>  
> -When tracing is enabled, kstop_machine is called to prevent
> -races with the CPUS executing code being modified (which can
> -cause the CPU to do undesirable things), and the nops are
> +When tracing is enabled, the process of modifying the function
> +tracepoints is dependent on architecture. The old method is to use
> +kstop_machine to prevent races with the CPUs executing code being
> +modified (which can cause the CPU to do undesirable things, especially
> +if the modified code crosses cache (or page) boundaries), and the nops are
>  patched back to calls. But this time, they do not call mcount
>  (which is just a function stub). They now call into the ftrace
>  infrastructure.
>  
> +The new method of modifying the function tracepoints is to place
> +a breakpoint at the location to be modified, sync all CPUs, modify
> +the rest of the instruction not covered by the breakpoint. Sync
> +all CPUs again, and then remove the breakpoint with the finished
> +version to the ftrace call site.
> +
> +Some archs do not even need to monkey around with the synchronization,
> +and can just slap the new code on top of the old without any
> +problems with other CPUs executing it at the same time.
> +
>  One special side-effect to the recording of the functions being
>  traced is that we can now selectively choose which functions we
>  wish to trace and which ones we want the mcount calls to remain
> @@ -1530,20 +2099,28 @@ mutex_lock
>  
>  If I am only interested in sys_nanosleep and hrtimer_interrupt:
>  
> - # echo sys_nanosleep hrtimer_interrupt \
> -		> set_ftrace_filter
> + # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
>   # echo function > current_tracer
>   # echo 1 > tracing_on
>   # usleep 1
>   # echo 0 > tracing_on
>   # cat trace
> -# tracer: ftrace
> +# tracer: function
> +#
> +# entries-in-buffer/entries-written: 5/5   #P:4
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> -          usleep-4134  [00]  1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
> -          usleep-4134  [00]  1317.070111: sys_nanosleep <-syscall_call
> -          <idle>-0     [00]  1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +          usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
> +          <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
> +          usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
> +          <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
> +          <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
>  
>  To see which functions are being traced, you can cat the file:
>  
> @@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards,
>  
>  Produces:
>  
> -# tracer: ftrace
> +# tracer: function
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> -            bash-4003  [00]  1480.611794: hrtimer_init <-copy_process
> -            bash-4003  [00]  1480.611941: hrtimer_start <-hrtick_set
> -            bash-4003  [00]  1480.611956: hrtimer_cancel <-hrtick_clear
> -            bash-4003  [00]  1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
> -          <idle>-0     [00]  1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
> -          <idle>-0     [00]  1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
> -          <idle>-0     [00]  1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
> -          <idle>-0     [00]  1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
> -          <idle>-0     [00]  1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
> -
> +# entries-in-buffer/entries-written: 897/897   #P:4
> +#
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +          <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
> +          <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
> +          <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
> +          <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
> +          <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
> +          <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
> +          <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
> +          <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
>  
>  Notice that we lost the sys_nanosleep.
>  
> @@ -1651,19 +2233,29 @@ traced.
>  
>  Produces:
>  
> -# tracer: ftrace
> +# tracer: function
> +#
> +# entries-in-buffer/entries-written: 39608/39608   #P:4
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> -            bash-4043  [01]   115.281644: finish_task_switch <-schedule
> -            bash-4043  [01]   115.281645: hrtick_set <-schedule
> -            bash-4043  [01]   115.281645: hrtick_clear <-hrtick_set
> -            bash-4043  [01]   115.281646: wait_for_completion <-__stop_machine_run
> -            bash-4043  [01]   115.281647: wait_for_common <-wait_for_completion
> -            bash-4043  [01]   115.281647: kthread_stop <-stop_machine_run
> -            bash-4043  [01]   115.281648: init_waitqueue_head <-kthread_stop
> -            bash-4043  [01]   115.281648: wake_up_process <-kthread_stop
> -            bash-4043  [01]   115.281649: try_to_wake_up <-wake_up_process
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +            bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
> +            bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
> +            bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
> +            bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
> +            bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
> +            bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
> +            bash-1994  [000] ....  4342.324899: do_truncate <-do_last
> +            bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
> +            bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
> +            bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
> +            bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
> +            bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
>  
>  We can see that there's no more lock or preempt tracing.
>  
> @@ -1729,6 +2321,28 @@ this special filter via:
>   echo > set_graph_function
>  
> 
> +ftrace_enabled
> +--------------
> +
> +Note, the proc sysctl ftrace_enable is a big on/off switch for the
> +function tracer. By default it is enabled (when function tracing is
> +enabled in the kernel). If it is disabled, all function tracing is
> +disabled. This includes not only the function tracers for ftrace, but
> +also for any other uses (perf, kprobes, stack tracing, profiling, etc).
> +
> +Please disable this with care.
> +
> +This can be disable (and enabled) with:
> +
> +  sysctl kernel.ftrace_enabled=0
> +  sysctl kernel.ftrace_enabled=1
> +
> + or
> +
> +  echo 0 > /proc/sys/kernel/ftrace_enabled
> +  echo 1 > /proc/sys/kernel/ftrace_enabled
> +
> +
>  Filter commands
>  ---------------
>  
> @@ -1763,12 +2377,58 @@ The following commands are supported:
>  
>     echo '__schedule_bug:traceoff:5' > set_ftrace_filter
>  
> +  To always disable tracing when __schedule_bug is hit:
> +
> +   echo '__schedule_bug:traceoff' > set_ftrace_filter
> +
>    These commands are cumulative whether or not they are appended
>    to set_ftrace_filter. To remove a command, prepend it by '!'
>    and drop the parameter:
>  
> +   echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
> +
> +    The above removes the traceoff command for __schedule_bug
> +    that have a counter. To remove commands without counters:
> +
>     echo '!__schedule_bug:traceoff' > set_ftrace_filter
>  
> +- snapshot
> +  Will cause a snapshot to be triggered when the function is hit.
> +
> +   echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
> +
> +  To only snapshot once:
> +
> +   echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
> +
> +  To remove the above commands:
> +
> +   echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
> +   echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
> +
> +- enable_event/disable_event
> +  These commands can enable or disable a trace event. Note, because
> +  function tracing callbacks are very sensitive, when these commands
> +  are registered, the trace point is activated, but disabled in
> +  a "soft" mode. That is, the tracepoint will be called, but
> +  just will not be traced. The event tracepoint stays in this mode
> +  as long as there's a command that triggers it.
> +
> +   echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
> +   	 set_ftrace_filter
> +
> +  The format is:
> +
> +    <function>:enable_event:<system>:<event>[:count]
> +    <function>:disable_event:<system>:<event>[:count]
> +
> +  To remove the events commands:
> +
> +
> +   echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
> +   	 set_ftrace_filter
> +   echo '!schedule:disable_event:sched:sched_switch' > \
> +   	 set_ftrace_filter
>  
>  trace_pipe
>  ----------
> @@ -1787,28 +2447,31 @@ different. The trace is live.
>   # cat trace
>  # tracer: function
>  #
> -#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
> -#              | |      |          |         |
> +# entries-in-buffer/entries-written: 0/0   #P:4
> +#
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
>  
>   #
>   # cat /tmp/trace.out
> -            bash-4043  [00] 41.267106: finish_task_switch <-schedule
> -            bash-4043  [00] 41.267106: hrtick_set <-schedule
> -            bash-4043  [00] 41.267107: hrtick_clear <-hrtick_set
> -            bash-4043  [00] 41.267108: wait_for_completion <-__stop_machine_run
> -            bash-4043  [00] 41.267108: wait_for_common <-wait_for_completion
> -            bash-4043  [00] 41.267109: kthread_stop <-stop_machine_run
> -            bash-4043  [00] 41.267109: init_waitqueue_head <-kthread_stop
> -            bash-4043  [00] 41.267110: wake_up_process <-kthread_stop
> -            bash-4043  [00] 41.267110: try_to_wake_up <-wake_up_process
> -            bash-4043  [00] 41.267111: select_task_rq_rt <-try_to_wake_up
> +            bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
> +            bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
> +            bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
> +            bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
> +            bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
> +            bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
> +            bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
> +            bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
> +            bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
>  
> 
>  Note, reading the trace_pipe file will block until more input is
> -added. By changing the tracer, trace_pipe will issue an EOF. We
> -needed to set the function tracer _before_ we "cat" the
> -trace_pipe file.
> -
> +added.
>  
>  trace entries
>  -------------
> @@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in
>  diagnosing an issue in the kernel. The file buffer_size_kb is
>  used to modify the size of the internal trace buffers. The
>  number listed is the number of entries that can be recorded per
> -CPU. To know the full size, multiply the number of possible CPUS
> +CPU. To know the full size, multiply the number of possible CPUs
>  with the number of entries.
>  
>   # cat buffer_size_kb
>  1408 (units kilobytes)
>  
> -Note, to modify this, you must have tracing completely disabled.
> -To do that, echo "nop" into the current_tracer. If the
> -current_tracer is not set to "nop", an EINVAL error will be
> -returned.
> +Or simply read buffer_total_size_kb
> +
> + # cat buffer_total_size_kb 
> +5632
> +
> +To modify the buffer, simple echo in a number (in 1024 byte segments).
>  
> - # echo nop > current_tracer
>   # echo 10000 > buffer_size_kb
>   # cat buffer_size_kb
>  10000 (units kilobytes)
>  
> -The number of pages which will be allocated is limited to a
> -percentage of available memory. Allocating too much will produce
> -an error.
> +It will try to allocate as much as possible. If you allocate too
> +much, it can cause Out-Of-Memory to trigger.
>  
>   # echo 1000000000000 > buffer_size_kb
>  -bash: echo: write error: Cannot allocate memory
>   # cat buffer_size_kb
>  85
>  
> +The per_cpu buffers can be changed individually as well:
> +
> + # echo 10000 > per_cpu/cpu0/buffer_size_kb
> + # echo 100 > per_cpu/cpu1/buffer_size_kb
> +
> +When the per_cpu buffers are not the same, the buffer_size_kb
> +at the top level will just show an X
> +
> + # cat buffer_size_kb
> +X
> +
> +This is where the buffer_total_size_kb is useful:
> +
> + # cat buffer_total_size_kb 
> +12916
> +
> +Writing to the top level buffer_size_kb will reset all the buffers
> +to be the same again.
> +
>  Snapshot
>  --------
>  CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
> @@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy
>   # cat snapshot
>  cat: snapshot: Device or resource busy
>  
> +
> +Instances
> +---------
> +In the debugfs tracing directory is a directory called "instances".
> +This directory can have new directories created inside of it using
> +mkdir, and removing directories with rmdir. The directory created
> +with mkdir in this directory will already contain files and other
> +directories after it is created.
> +
> + # mkdir instances/foo
> + # ls instances/foo
> +buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
> +set_event  snapshot  trace  trace_clock  trace_marker  trace_options
> +trace_pipe  tracing_on
> +
> +As you can see, the new directory looks similar to the tracing directory
> +itself. In fact, it is very similar, except that the buffer and
> +events are agnostic from the main director, or from any other
> +instances that are created.
> +
> +The files in the new directory work just like the files with the
> +same name in the tracing directory except the buffer that is used
> +is a separate and new buffer. The files affect that buffer but do not
> +affect the main buffer with the exception of trace_options. Currently,
> +the trace_options affect all instances and the top level buffer
> +the same, but this may change in future releases. That is, options
> +may become specific to the instance they reside in.
> +
> +Notice that none of the function tracer files are there, nor is
> +current_tracer and available_tracers. This is because the buffers
> +can currently only have events enabled for them.
> +
> + # mkdir instances/foo
> + # mkdir instances/bar
> + # mkdir instances/zoot
> + # echo 100000 > buffer_size_kb
> + # echo 1000 > instances/foo/buffer_size_kb
> + # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
> + # echo function > current_trace
> + # echo 1 > instances/foo/events/sched/sched_wakeup/enable
> + # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
> + # echo 1 > instances/foo/events/sched/sched_switch/enable
> + # echo 1 > instances/bar/events/irq/enable
> + # echo 1 > instances/zoot/events/syscalls/enable
> + # cat trace_pipe
> +CPU:2 [LOST 11745 EVENTS]
> +            bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
> +            bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
> +            bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
> +            bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
> +            bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
> +            bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
> +            bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
> +            bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
> +            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
> +            bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
> +            bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
> +[...]
> +
> + # cat instances/foo/trace_pipe
> +            bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
> +            bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
> +          <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
> +          <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
> +     rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> +            bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
> +            bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
> +            bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
> +     kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
> +     kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
> +[...]
> +
> + # cat instances/bar/trace_pipe
> +     migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
> +          <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
> +            bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
> +            bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
> +            bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
> +            bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
> +            bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
> +            bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
> +            sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
> +            sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
> +            sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
> +            sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
> +[...]
> +
> + # cat instances/zoot/trace
> +# tracer: nop
> +#
> +# entries-in-buffer/entries-written: 18996/18996   #P:4
> +#
> +#                              _-----=> irqs-off
> +#                             / _----=> need-resched
> +#                            | / _---=> hardirq/softirq
> +#                            || / _--=> preempt-depth
> +#                            ||| /     delay
> +#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
> +#              | |       |   ||||       |         |
> +            bash-1998  [000] d...   140.733501: sys_write -> 0x2
> +            bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
> +            bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
> +            bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
> +            bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
> +            bash-1998  [000] d...   140.733510: sys_close(fd: a)
> +            bash-1998  [000] d...   140.733510: sys_close -> 0x0
> +            bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
> +            bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
> +            bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
> +            bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
> +
> +You can see that the trace of the top most trace buffer shows only
> +the function tracing. The foo instance displays wakeups and task
> +switches.
> +
> +To remove the instances, simply delete their directories:
> +
> + # rmdir instances/foo
> + # rmdir instances/bar
> + # rmdir instances/zoot
> +
> +Note, if a process has a trace file open in one of the instance
> +directories, the rmdir will fail with EBUSY.
> +
> +
> +Stack trace
>  -----------
> +Since the kernel has a fixed sized stack, it is important not to
> +waste it in functions. A kernel developer must be conscience of
> +what they allocate on the stack. If they add too much, the system
> +can be in danger of a stack overflow, and corruption will occur,
> +usually leading to a system panic.
> +
> +There are some tools that check this, usually with interrupts
> +periodically checking usage. But if you can perform a check
> +at every function call that will become very useful. As ftrace provides
> +a function tracer, it makes it convenient to check the stack size
> +at every function call. This is enabled via the stack tracer.
> +
> +CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
> +To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
> +
> + # echo 1 > /proc/sys/kernel/stack_tracer_enabled
> +
> +You can also enable it from the kernel command line to trace
> +the stack size of the kernel during boot up, by adding "stacktrace"
> +to the kernel command line parameter.
> +
> +After running it for a few minutes, the output looks like:
> +
> + # cat stack_max_size
> +2928
> +
> + # cat stack_trace
> +        Depth    Size   Location    (18 entries)
> +        -----    ----   --------
> +  0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
> +  1)     2704     160   find_busiest_group+0x31/0x1f1
> +  2)     2544     256   load_balance+0xd9/0x662
> +  3)     2288      80   idle_balance+0xbb/0x130
> +  4)     2208     128   __schedule+0x26e/0x5b9
> +  5)     2080      16   schedule+0x64/0x66
> +  6)     2064     128   schedule_timeout+0x34/0xe0
> +  7)     1936     112   wait_for_common+0x97/0xf1
> +  8)     1824      16   wait_for_completion+0x1d/0x1f
> +  9)     1808     128   flush_work+0xfe/0x119
> + 10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
> + 11)     1664      48   input_available_p+0x1d/0x5c
> + 12)     1616      48   n_tty_poll+0x6d/0x134
> + 13)     1568      64   tty_poll+0x64/0x7f
> + 14)     1504     880   do_select+0x31e/0x511
> + 15)      624     400   core_sys_select+0x177/0x216
> + 16)      224      96   sys_select+0x91/0xb9
> + 17)      128     128   system_call_fastpath+0x16/0x1b
> +
> +Note, if -mfentry is being used by gcc, functions get traced before
> +they set up the stack frame. This means that leaf level functions
> +are not tested by the stack tracer when -mfentry is used.
> +
> +Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
> +
> +---------
>  
>  More details can be found in the source code, in the
>  kernel/trace/*.c files.
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e5ca8ef..832422d 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -259,8 +259,10 @@ struct ftrace_probe_ops {
>  	void			(*func)(unsigned long ip,
>  					unsigned long parent_ip,
>  					void **data);
> -	int			(*callback)(unsigned long ip, void **data);
> -	void			(*free)(void **data);
> +	int			(*init)(struct ftrace_probe_ops *ops,
> +					unsigned long ip, void **data);
> +	void			(*free)(struct ftrace_probe_ops *ops,
> +					unsigned long ip, void **data);
>  	int			(*print)(struct seq_file *m,
>  					 unsigned long ip,
>  					 struct ftrace_probe_ops *ops,
> diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
> index 13a54d0..4e28b01 100644
> --- a/include/linux/ftrace_event.h
> +++ b/include/linux/ftrace_event.h
> @@ -8,6 +8,7 @@
>  #include <linux/perf_event.h>
>  
>  struct trace_array;
> +struct trace_buffer;
>  struct tracer;
>  struct dentry;
>  
> @@ -38,6 +39,12 @@ const char *ftrace_print_symbols_seq_u64(struct trace_seq *p,
>  const char *ftrace_print_hex_seq(struct trace_seq *p,
>  				 const unsigned char *buf, int len);
>  
> +struct trace_iterator;
> +struct trace_event;
> +
> +int ftrace_raw_output_prep(struct trace_iterator *iter,
> +			   struct trace_event *event);
> +
>  /*
>   * The trace entry - the most basic unit of tracing. This is what
>   * is printed in the end as a single line in the trace output, such as:
> @@ -61,6 +68,7 @@ struct trace_entry {
>  struct trace_iterator {
>  	struct trace_array	*tr;
>  	struct tracer		*trace;
> +	struct trace_buffer	*trace_buffer;
>  	void			*private;
>  	int			cpu_file;
>  	struct mutex		mutex;
> @@ -95,8 +103,6 @@ enum trace_iter_flags {
>  };
>  
> 
> -struct trace_event;
> -
>  typedef enum print_line_t (*trace_print_func)(struct trace_iterator *iter,
>  				      int flags, struct trace_event *event);
>  
> @@ -128,6 +134,13 @@ enum print_line_t {
>  void tracing_generic_entry_update(struct trace_entry *entry,
>  				  unsigned long flags,
>  				  int pc);
> +struct ftrace_event_file;
> +
> +struct ring_buffer_event *
> +trace_event_buffer_lock_reserve(struct ring_buffer **current_buffer,
> +				struct ftrace_event_file *ftrace_file,
> +				int type, unsigned long len,
> +				unsigned long flags, int pc);
>  struct ring_buffer_event *
>  trace_current_buffer_lock_reserve(struct ring_buffer **current_buffer,
>  				  int type, unsigned long len,
> @@ -182,53 +195,49 @@ extern int ftrace_event_reg(struct ftrace_event_call *event,
>  			    enum trace_reg type, void *data);
>  
>  enum {
> -	TRACE_EVENT_FL_ENABLED_BIT,
>  	TRACE_EVENT_FL_FILTERED_BIT,
> -	TRACE_EVENT_FL_RECORDED_CMD_BIT,
>  	TRACE_EVENT_FL_CAP_ANY_BIT,
>  	TRACE_EVENT_FL_NO_SET_FILTER_BIT,
>  	TRACE_EVENT_FL_IGNORE_ENABLE_BIT,
> +	TRACE_EVENT_FL_WAS_ENABLED_BIT,
>  };
>  
> +/*
> + * Event flags:
> + *  FILTERED	  - The event has a filter attached
> + *  CAP_ANY	  - Any user can enable for perf
> + *  NO_SET_FILTER - Set when filter has error and is to be ignored
> + *  IGNORE_ENABLE - For ftrace internal events, do not enable with debugfs file
> + *  WAS_ENABLED   - Set and stays set when an event was ever enabled
> + *                    (used for module unloading, if a module event is enabled,
> + *                     it is best to clear the buffers that used it).
> + */
>  enum {
> -	TRACE_EVENT_FL_ENABLED		= (1 << TRACE_EVENT_FL_ENABLED_BIT),
>  	TRACE_EVENT_FL_FILTERED		= (1 << TRACE_EVENT_FL_FILTERED_BIT),
> -	TRACE_EVENT_FL_RECORDED_CMD	= (1 << TRACE_EVENT_FL_RECORDED_CMD_BIT),
>  	TRACE_EVENT_FL_CAP_ANY		= (1 << TRACE_EVENT_FL_CAP_ANY_BIT),
>  	TRACE_EVENT_FL_NO_SET_FILTER	= (1 << TRACE_EVENT_FL_NO_SET_FILTER_BIT),
>  	TRACE_EVENT_FL_IGNORE_ENABLE	= (1 << TRACE_EVENT_FL_IGNORE_ENABLE_BIT),
> +	TRACE_EVENT_FL_WAS_ENABLED	= (1 << TRACE_EVENT_FL_WAS_ENABLED_BIT),
>  };
>  
>  struct ftrace_event_call {
>  	struct list_head	list;
>  	struct ftrace_event_class *class;
>  	char			*name;
> -	struct dentry		*dir;
>  	struct trace_event	event;
>  	const char		*print_fmt;
>  	struct event_filter	*filter;
> +	struct list_head	*files;
>  	void			*mod;
>  	void			*data;
> -
>  	/*
> -	 * 32 bit flags:
> -	 *   bit 1:		enabled
> -	 *   bit 2:		filter_active
> -	 *   bit 3:		enabled cmd record
> -	 *   bit 4:		allow trace by non root (cap any)
> -	 *   bit 5:		failed to apply filter
> -	 *   bit 6:		ftrace internal event (do not enable)
> -	 *
> -	 * Changes to flags must hold the event_mutex.
> -	 *
> -	 * Note: Reads of flags do not hold the event_mutex since
> -	 * they occur in critical sections. But the way flags
> -	 * is currently used, these changes do no affect the code
> -	 * except that when a change is made, it may have a slight
> -	 * delay in propagating the changes to other CPUs due to
> -	 * caching and such.
> +	 *   bit 0:		filter_active
> +	 *   bit 1:		allow trace by non root (cap any)
> +	 *   bit 2:		failed to apply filter
> +	 *   bit 3:		ftrace internal event (do not enable)
> +	 *   bit 4:		Event was enabled by module
>  	 */
> -	unsigned int		flags;
> +	int			flags; /* static flags of different events */
>  
>  #ifdef CONFIG_PERF_EVENTS
>  	int				perf_refcount;
> @@ -236,6 +245,56 @@ struct ftrace_event_call {
>  #endif
>  };
>  
> +struct trace_array;
> +struct ftrace_subsystem_dir;
> +
> +enum {
> +	FTRACE_EVENT_FL_ENABLED_BIT,
> +	FTRACE_EVENT_FL_RECORDED_CMD_BIT,
> +	FTRACE_EVENT_FL_SOFT_MODE_BIT,
> +	FTRACE_EVENT_FL_SOFT_DISABLED_BIT,
> +};
> +
> +/*
> + * Ftrace event file flags:
> + *  ENABLED	  - The event is enabled
> + *  RECORDED_CMD  - The comms should be recorded at sched_switch
> + *  SOFT_MODE     - The event is enabled/disabled by SOFT_DISABLED
> + *  SOFT_DISABLED - When set, do not trace the event (even though its
> + *                   tracepoint may be enabled)
> + */
> +enum {
> +	FTRACE_EVENT_FL_ENABLED		= (1 << FTRACE_EVENT_FL_ENABLED_BIT),
> +	FTRACE_EVENT_FL_RECORDED_CMD	= (1 << FTRACE_EVENT_FL_RECORDED_CMD_BIT),
> +	FTRACE_EVENT_FL_SOFT_MODE	= (1 << FTRACE_EVENT_FL_SOFT_MODE_BIT),
> +	FTRACE_EVENT_FL_SOFT_DISABLED	= (1 << FTRACE_EVENT_FL_SOFT_DISABLED_BIT),
> +};
> +
> +struct ftrace_event_file {
> +	struct list_head		list;
> +	struct ftrace_event_call	*event_call;
> +	struct dentry			*dir;
> +	struct trace_array		*tr;
> +	struct ftrace_subsystem_dir	*system;
> +
> +	/*
> +	 * 32 bit flags:
> +	 *   bit 0:		enabled
> +	 *   bit 1:		enabled cmd record
> +	 *   bit 2:		enable/disable with the soft disable bit
> +	 *   bit 3:		soft disabled
> +	 *
> +	 * Note: The bits must be set atomically to prevent races
> +	 * from other writers. Reads of flags do not need to be in
> +	 * sync as they occur in critical sections. But the way flags
> +	 * is currently used, these changes do not affect the code
> +	 * except that when a change is made, it may have a slight
> +	 * delay in propagating the changes to other CPUs due to
> +	 * caching and such. Which is mostly OK ;-)
> +	 */
> +	unsigned long		flags;
> +};
> +
>  #define __TRACE_EVENT_FLAGS(name, value)				\
>  	static int __init trace_init_flags_##name(void)			\
>  	{								\
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index c566927..239dbb9 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -483,6 +483,8 @@ enum ftrace_dump_mode {
>  void tracing_on(void);
>  void tracing_off(void);
>  int tracing_is_on(void);
> +void tracing_snapshot(void);
> +void tracing_snapshot_alloc(void);
>  
>  extern void tracing_start(void);
>  extern void tracing_stop(void);
> @@ -512,10 +514,32 @@ do {									\
>   *
>   * This is intended as a debugging tool for the developer only.
>   * Please refrain from leaving trace_printks scattered around in
> - * your code.
> + * your code. (Extra memory is used for special buffers that are
> + * allocated when trace_printk() is used)
> + *
> + * A little optization trick is done here. If there's only one
> + * argument, there's no need to scan the string for printf formats.
> + * The trace_puts() will suffice. But how can we take advantage of
> + * using trace_puts() when trace_printk() has only one argument?
> + * By stringifying the args and checking the size we can tell
> + * whether or not there are args. __stringify((__VA_ARGS__)) will
> + * turn into "()\0" with a size of 3 when there are no args, anything
> + * else will be bigger. All we need to do is define a string to this,
> + * and then take its size and compare to 3. If it's bigger, use
> + * do_trace_printk() otherwise, optimize it to trace_puts(). Then just
> + * let gcc optimize the rest.
>   */
>  
> -#define trace_printk(fmt, args...)					\
> +#define trace_printk(fmt, ...)				\
> +do {							\
> +	char _______STR[] = __stringify((__VA_ARGS__));	\
> +	if (sizeof(_______STR) > 3)			\
> +		do_trace_printk(fmt, ##__VA_ARGS__);	\
> +	else						\
> +		trace_puts(fmt);			\
> +} while (0)
> +
> +#define do_trace_printk(fmt, args...)					\
>  do {									\
>  	static const char *trace_printk_fmt				\
>  		__attribute__((section("__trace_printk_fmt"))) =	\
> @@ -535,7 +559,45 @@ int __trace_bprintk(unsigned long ip, const char *fmt, ...);
>  extern __printf(2, 3)
>  int __trace_printk(unsigned long ip, const char *fmt, ...);
>  
> -extern void trace_dump_stack(void);
> +/**
> + * trace_puts - write a string into the ftrace buffer
> + * @str: the string to record
> + *
> + * Note: __trace_bputs is an internal function for trace_puts and
> + *       the @ip is passed in via the trace_puts macro.
> + *
> + * This is similar to trace_printk() but is made for those really fast
> + * paths that a developer wants the least amount of "Heisenbug" affects,
> + * where the processing of the print format is still too much.
> + *
> + * This function allows a kernel developer to debug fast path sections
> + * that printk is not appropriate for. By scattering in various
> + * printk like tracing in the code, a developer can quickly see
> + * where problems are occurring.
> + *
> + * This is intended as a debugging tool for the developer only.
> + * Please refrain from leaving trace_puts scattered around in
> + * your code. (Extra memory is used for special buffers that are
> + * allocated when trace_puts() is used)
> + *
> + * Returns: 0 if nothing was written, positive # if string was.
> + *  (1 when __trace_bputs is used, strlen(str) when __trace_puts is used)
> + */
> +
> +extern int __trace_bputs(unsigned long ip, const char *str);
> +extern int __trace_puts(unsigned long ip, const char *str, int size);
> +#define trace_puts(str) ({						\
> +	static const char *trace_printk_fmt				\
> +		__attribute__((section("__trace_printk_fmt"))) =	\
> +		__builtin_constant_p(str) ? str : NULL;			\
> +									\
> +	if (__builtin_constant_p(str))					\
> +		__trace_bputs(_THIS_IP_, trace_printk_fmt);		\
> +	else								\
> +		__trace_puts(_THIS_IP_, str, strlen(str));		\
> +})
> +
> +extern void trace_dump_stack(int skip);
>  
>  /*
>   * The double __builtin_constant_p is because gcc will give us an error
> @@ -570,6 +632,8 @@ static inline void trace_dump_stack(void) { }
>  static inline void tracing_on(void) { }
>  static inline void tracing_off(void) { }
>  static inline int tracing_is_on(void) { return 0; }
> +static inline void tracing_snapshot(void) { }
> +static inline void tracing_snapshot_alloc(void) { }
>  
>  static inline __printf(1, 2)
>  int trace_printk(const char *fmt, ...)
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 1342e69..d69cf63 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -4,6 +4,7 @@
>  #include <linux/kmemcheck.h>
>  #include <linux/mm.h>
>  #include <linux/seq_file.h>
> +#include <linux/poll.h>
>  
>  struct ring_buffer;
>  struct ring_buffer_iter;
> @@ -96,6 +97,11 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
>  	__ring_buffer_alloc((size), (flags), &__key);	\
>  })
>  
> +void ring_buffer_wait(struct ring_buffer *buffer, int cpu);
> +int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
> +			  struct file *filp, poll_table *poll_table);
> +
> +
>  #define RING_BUFFER_ALL_CPUS -1
>  
>  void ring_buffer_free(struct ring_buffer *buffer);
> diff --git a/include/linux/trace_clock.h b/include/linux/trace_clock.h
> index d563f37..1d7ca27 100644
> --- a/include/linux/trace_clock.h
> +++ b/include/linux/trace_clock.h
> @@ -16,6 +16,7 @@
>  
>  extern u64 notrace trace_clock_local(void);
>  extern u64 notrace trace_clock(void);
> +extern u64 notrace trace_clock_jiffies(void);
>  extern u64 notrace trace_clock_global(void);
>  extern u64 notrace trace_clock_counter(void);
>  
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 40dc5e8..4bda044 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -227,29 +227,18 @@ static notrace enum print_line_t					\
>  ftrace_raw_output_##call(struct trace_iterator *iter, int flags,	\
>  			 struct trace_event *trace_event)		\
>  {									\
> -	struct ftrace_event_call *event;				\
>  	struct trace_seq *s = &iter->seq;				\
> +	struct trace_seq __maybe_unused *p = &iter->tmp_seq;		\
>  	struct ftrace_raw_##call *field;				\
> -	struct trace_entry *entry;					\
> -	struct trace_seq *p = &iter->tmp_seq;				\
>  	int ret;							\
>  									\
> -	event = container_of(trace_event, struct ftrace_event_call,	\
> -			     event);					\
> -									\
> -	entry = iter->ent;						\
> -									\
> -	if (entry->type != event->event.type) {				\
> -		WARN_ON_ONCE(1);					\
> -		return TRACE_TYPE_UNHANDLED;				\
> -	}								\
> -									\
> -	field = (typeof(field))entry;					\
> +	field = (typeof(field))iter->ent;				\
>  									\
> -	trace_seq_init(p);						\
> -	ret = trace_seq_printf(s, "%s: ", event->name);			\
> +	ret = ftrace_raw_output_prep(iter, trace_event);		\
>  	if (ret)							\
> -		ret = trace_seq_printf(s, print);			\
> +		return ret;						\
> +									\
> +	ret = trace_seq_printf(s, print);				\
>  	if (!ret)							\
>  		return TRACE_TYPE_PARTIAL_LINE;				\
>  									\
> @@ -335,7 +324,7 @@ static struct trace_event_functions ftrace_event_type_funcs_##call = {	\
>  
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, func, print)	\
> -static int notrace							\
> +static int notrace __init						\
>  ftrace_define_fields_##call(struct ftrace_event_call *event_call)	\
>  {									\
>  	struct ftrace_raw_##call field;					\
> @@ -414,7 +403,8 @@ static inline notrace int ftrace_get_offsets_##call(			\
>   *
>   * static void ftrace_raw_event_<call>(void *__data, proto)
>   * {
> - *	struct ftrace_event_call *event_call = __data;
> + *	struct ftrace_event_file *ftrace_file = __data;
> + *	struct ftrace_event_call *event_call = ftrace_file->event_call;
>   *	struct ftrace_data_offsets_<call> __maybe_unused __data_offsets;
>   *	struct ring_buffer_event *event;
>   *	struct ftrace_raw_<call> *entry; <-- defined in stage 1
> @@ -423,12 +413,16 @@ static inline notrace int ftrace_get_offsets_##call(			\
>   *	int __data_size;
>   *	int pc;
>   *
> + *	if (test_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT,
> + *		     &ftrace_file->flags))
> + *		return;
> + *
>   *	local_save_flags(irq_flags);
>   *	pc = preempt_count();
>   *
>   *	__data_size = ftrace_get_offsets_<call>(&__data_offsets, args);
>   *
> - *	event = trace_current_buffer_lock_reserve(&buffer,
> + *	event = trace_event_buffer_lock_reserve(&buffer, ftrace_file,
>   *				  event_<call>->event.type,
>   *				  sizeof(*entry) + __data_size,
>   *				  irq_flags, pc);
> @@ -440,7 +434,7 @@ static inline notrace int ftrace_get_offsets_##call(			\
>   *			   __array macros.
>   *
>   *	if (!filter_current_check_discard(buffer, event_call, entry, event))
> - *		trace_current_buffer_unlock_commit(buffer,
> + *		trace_nowake_buffer_unlock_commit(buffer,
>   *						   event, irq_flags, pc);
>   * }
>   *
> @@ -518,7 +512,8 @@ static inline notrace int ftrace_get_offsets_##call(			\
>  static notrace void							\
>  ftrace_raw_event_##call(void *__data, proto)				\
>  {									\
> -	struct ftrace_event_call *event_call = __data;			\
> +	struct ftrace_event_file *ftrace_file = __data;			\
> +	struct ftrace_event_call *event_call = ftrace_file->event_call;	\
>  	struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
>  	struct ring_buffer_event *event;				\
>  	struct ftrace_raw_##call *entry;				\
> @@ -527,12 +522,16 @@ ftrace_raw_event_##call(void *__data, proto)				\
>  	int __data_size;						\
>  	int pc;								\
>  									\
> +	if (test_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT,			\
> +		     &ftrace_file->flags))				\
> +		return;							\
> +									\
>  	local_save_flags(irq_flags);					\
>  	pc = preempt_count();						\
>  									\
>  	__data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
>  									\
> -	event = trace_current_buffer_lock_reserve(&buffer,		\
> +	event = trace_event_buffer_lock_reserve(&buffer, ftrace_file,	\
>  				 event_call->event.type,		\
>  				 sizeof(*entry) + __data_size,		\
>  				 irq_flags, pc);			\
> @@ -581,7 +580,7 @@ static inline void ftrace_test_probe_##call(void)			\
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  _TRACE_PERF_PROTO(call, PARAMS(proto));					\
>  static const char print_fmt_##call[] = print;				\
> -static struct ftrace_event_class __used event_class_##call = {		\
> +static struct ftrace_event_class __used __refdata event_class_##call = { \
>  	.system			= __stringify(TRACE_SYSTEM),		\
>  	.define_fields		= ftrace_define_fields_##call,		\
>  	.fields			= LIST_HEAD_INIT(event_class_##call.fields),\
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index b516a8e..0b5ecf5 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -191,6 +191,8 @@ config IRQSOFF_TRACER
>  	select GENERIC_TRACER
>  	select TRACER_MAX_TRACE
>  	select RING_BUFFER_ALLOW_SWAP
> +	select TRACER_SNAPSHOT
> +	select TRACER_SNAPSHOT_PER_CPU_SWAP
>  	help
>  	  This option measures the time spent in irqs-off critical
>  	  sections, with microsecond accuracy.
> @@ -213,6 +215,8 @@ config PREEMPT_TRACER
>  	select GENERIC_TRACER
>  	select TRACER_MAX_TRACE
>  	select RING_BUFFER_ALLOW_SWAP
> +	select TRACER_SNAPSHOT
> +	select TRACER_SNAPSHOT_PER_CPU_SWAP
>  	help
>  	  This option measures the time spent in preemption-off critical
>  	  sections, with microsecond accuracy.
> @@ -232,6 +236,7 @@ config SCHED_TRACER
>  	select GENERIC_TRACER
>  	select CONTEXT_SWITCH_TRACER
>  	select TRACER_MAX_TRACE
> +	select TRACER_SNAPSHOT
>  	help
>  	  This tracer tracks the latency of the highest priority task
>  	  to be scheduled in, starting from the point it has woken up.
> @@ -263,6 +268,27 @@ config TRACER_SNAPSHOT
>  	      echo 1 > /sys/kernel/debug/tracing/snapshot
>  	      cat snapshot
>  
> +config TRACER_SNAPSHOT_PER_CPU_SWAP
> +        bool "Allow snapshot to swap per CPU"
> +	depends on TRACER_SNAPSHOT
> +	select RING_BUFFER_ALLOW_SWAP
> +	help
> +	  Allow doing a snapshot of a single CPU buffer instead of a
> +	  full swap (all buffers). If this is set, then the following is
> +	  allowed:
> +
> +	      echo 1 > /sys/kernel/debug/tracing/per_cpu/cpu2/snapshot
> +
> +	  After which, only the tracing buffer for CPU 2 was swapped with
> +	  the main tracing buffer, and the other CPU buffers remain the same.
> +
> +	  When this is enabled, this adds a little more overhead to the
> +	  trace recording, as it needs to add some checks to synchronize
> +	  recording with swaps. But this does not affect the performance
> +	  of the overall system. This is enabled by default when the preempt
> +	  or irq latency tracers are enabled, as those need to swap as well
> +	  and already adds the overhead (plus a lot more).
> +
>  config TRACE_BRANCH_PROFILING
>  	bool
>  	select GENERIC_TRACER
> @@ -539,6 +565,29 @@ config RING_BUFFER_BENCHMARK
>  
>  	  If unsure, say N.
>  
> +config RING_BUFFER_STARTUP_TEST
> +       bool "Ring buffer startup self test"
> +       depends on RING_BUFFER
> +       help
> +         Run a simple self test on the ring buffer on boot up. Late in the
> +	 kernel boot sequence, the test will start that kicks off
> +	 a thread per cpu. Each thread will write various size events
> +	 into the ring buffer. Another thread is created to send IPIs
> +	 to each of the threads, where the IPI handler will also write
> +	 to the ring buffer, to test/stress the nesting ability.
> +	 If any anomalies are discovered, a warning will be displayed
> +	 and all ring buffers will be disabled.
> +
> +	 The test runs for 10 seconds. This will slow your boot time
> +	 by at least 10 more seconds.
> +
> +	 At the end of the test, statics and more checks are done.
> +	 It will output the stats of each per cpu buffer. What
> +	 was written, the sizes, what was read, what was lost, and
> +	 other similar details.
> +
> +	 If unsure, say N
> +
>  endif # FTRACE
>  
>  endif # TRACING_SUPPORT
> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
> index 71259e2..90a5505 100644
> --- a/kernel/trace/blktrace.c
> +++ b/kernel/trace/blktrace.c
> @@ -72,7 +72,7 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
>  	bool blk_tracer = blk_tracer_enabled;
>  
>  	if (blk_tracer) {
> -		buffer = blk_tr->buffer;
> +		buffer = blk_tr->trace_buffer.buffer;
>  		pc = preempt_count();
>  		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
>  						  sizeof(*t) + len,
> @@ -218,7 +218,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
>  	if (blk_tracer) {
>  		tracing_record_cmdline(current);
>  
> -		buffer = blk_tr->buffer;
> +		buffer = blk_tr->trace_buffer.buffer;
>  		pc = preempt_count();
>  		event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
>  						  sizeof(*t) + pdu_len,
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index e6effd0..2577082 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1068,7 +1068,7 @@ struct ftrace_func_probe {
>  	unsigned long		flags;
>  	unsigned long		ip;
>  	void			*data;
> -	struct rcu_head		rcu;
> +	struct list_head	free_list;
>  };
>  
>  struct ftrace_func_entry {
> @@ -2978,28 +2978,27 @@ static void __disable_ftrace_function_probe(void)
>  }
>  
> 
> -static void ftrace_free_entry_rcu(struct rcu_head *rhp)
> +static void ftrace_free_entry(struct ftrace_func_probe *entry)
>  {
> -	struct ftrace_func_probe *entry =
> -		container_of(rhp, struct ftrace_func_probe, rcu);
> -
>  	if (entry->ops->free)
> -		entry->ops->free(&entry->data);
> +		entry->ops->free(entry->ops, entry->ip, &entry->data);
>  	kfree(entry);
>  }
>  
> -
>  int
>  register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  			      void *data)
>  {
>  	struct ftrace_func_probe *entry;
> +	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
> +	struct ftrace_hash *hash;
>  	struct ftrace_page *pg;
>  	struct dyn_ftrace *rec;
>  	int type, len, not;
>  	unsigned long key;
>  	int count = 0;
>  	char *search;
> +	int ret;
>  
>  	type = filter_parse_regex(glob, strlen(glob), &search, &not);
>  	len = strlen(search);
> @@ -3010,8 +3009,16 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  
>  	mutex_lock(&ftrace_lock);
>  
> -	if (unlikely(ftrace_disabled))
> +	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
> +	if (!hash) {
> +		count = -ENOMEM;
> +		goto out_unlock;
> +	}
> +
> +	if (unlikely(ftrace_disabled)) {
> +		count = -ENODEV;
>  		goto out_unlock;
> +	}
>  
>  	do_for_each_ftrace_rec(pg, rec) {
>  
> @@ -3035,14 +3042,21 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  		 * for each function we find. We call the callback
>  		 * to give the caller an opportunity to do so.
>  		 */
> -		if (ops->callback) {
> -			if (ops->callback(rec->ip, &entry->data) < 0) {
> +		if (ops->init) {
> +			if (ops->init(ops, rec->ip, &entry->data) < 0) {
>  				/* caller does not like this func */
>  				kfree(entry);
>  				continue;
>  			}
>  		}
>  
> +		ret = enter_record(hash, rec, 0);
> +		if (ret < 0) {
> +			kfree(entry);
> +			count = ret;
> +			goto out_unlock;
> +		}
> +
>  		entry->ops = ops;
>  		entry->ip = rec->ip;
>  
> @@ -3050,10 +3064,16 @@ register_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  		hlist_add_head_rcu(&entry->node, &ftrace_func_hash[key]);
>  
>  	} while_for_each_ftrace_rec();
> +
> +	ret = ftrace_hash_move(&trace_probe_ops, 1, orig_hash, hash);
> +	if (ret < 0)
> +		count = ret;
> +
>  	__enable_ftrace_function_probe();
>  
>   out_unlock:
>  	mutex_unlock(&ftrace_lock);
> +	free_ftrace_hash(hash);
>  
>  	return count;
>  }
> @@ -3067,7 +3087,12 @@ static void
>  __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  				  void *data, int flags)
>  {
> +	struct ftrace_func_entry *rec_entry;
>  	struct ftrace_func_probe *entry;
> +	struct ftrace_func_probe *p;
> +	struct ftrace_hash **orig_hash = &trace_probe_ops.filter_hash;
> +	struct list_head free_list;
> +	struct ftrace_hash *hash;
>  	struct hlist_node *n, *tmp;
>  	char str[KSYM_SYMBOL_LEN];
>  	int type = MATCH_FULL;
> @@ -3088,6 +3113,14 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  	}
>  
>  	mutex_lock(&ftrace_lock);
> +
> +	hash = alloc_and_copy_ftrace_hash(FTRACE_HASH_DEFAULT_BITS, *orig_hash);
> +	if (!hash)
> +		/* Hmm, should report this somehow */
> +		goto out_unlock;
> +
> +	INIT_LIST_HEAD(&free_list);
> +
>  	for (i = 0; i < FTRACE_FUNC_HASHSIZE; i++) {
>  		struct hlist_head *hhd = &ftrace_func_hash[i];
>  
> @@ -3108,12 +3141,30 @@ __unregister_ftrace_function_probe(char *glob, struct ftrace_probe_ops *ops,
>  					continue;
>  			}
>  
> +			rec_entry = ftrace_lookup_ip(hash, entry->ip);
> +			/* It is possible more than one entry had this ip */
> +			if (rec_entry)
> +				free_hash_entry(hash, rec_entry);
> +
>  			hlist_del_rcu(&entry->node);
> -			call_rcu_sched(&entry->rcu, ftrace_free_entry_rcu);
> +			list_add(&entry->free_list, &free_list);
>  		}
>  	}
>  	__disable_ftrace_function_probe();
> +	/*
> +	 * Remove after the disable is called. Otherwise, if the last
> +	 * probe is removed, a null hash means *all enabled*.
> +	 */
> +	ftrace_hash_move(&trace_probe_ops, 1, orig_hash, hash);
> +	synchronize_sched();
> +	list_for_each_entry_safe(entry, p, &free_list, free_list) {
> +		list_del(&entry->free_list);
> +		ftrace_free_entry(entry);
> +	}
> +		
> + out_unlock:
>  	mutex_unlock(&ftrace_lock);
> +	free_ftrace_hash(hash);
>  }
>  
>  void
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7244acd..e5472f7 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -8,13 +8,16 @@
>  #include <linux/trace_clock.h>
>  #include <linux/trace_seq.h>
>  #include <linux/spinlock.h>
> +#include <linux/irq_work.h>
>  #include <linux/debugfs.h>
>  #include <linux/uaccess.h>
>  #include <linux/hardirq.h>
> +#include <linux/kthread.h>	/* for self test */
>  #include <linux/kmemcheck.h>
>  #include <linux/module.h>
>  #include <linux/percpu.h>
>  #include <linux/mutex.h>
> +#include <linux/delay.h>
>  #include <linux/slab.h>
>  #include <linux/init.h>
>  #include <linux/hash.h>
> @@ -442,6 +445,12 @@ int ring_buffer_print_page_header(struct trace_seq *s)
>  	return ret;
>  }
>  
> +struct rb_irq_work {
> +	struct irq_work			work;
> +	wait_queue_head_t		waiters;
> +	bool				waiters_pending;
> +};
> +
>  /*
>   * head_page == tail_page && head == tail then buffer is empty.
>   */
> @@ -476,6 +485,8 @@ struct ring_buffer_per_cpu {
>  	struct list_head		new_pages; /* new pages to add */
>  	struct work_struct		update_pages_work;
>  	struct completion		update_done;
> +
> +	struct rb_irq_work		irq_work;
>  };
>  
>  struct ring_buffer {
> @@ -495,6 +506,8 @@ struct ring_buffer {
>  	struct notifier_block		cpu_notify;
>  #endif
>  	u64				(*clock)(void);
> +
> +	struct rb_irq_work		irq_work;
>  };
>  
>  struct ring_buffer_iter {
> @@ -506,6 +519,118 @@ struct ring_buffer_iter {
>  	u64				read_stamp;
>  };
>  
> +/*
> + * rb_wake_up_waiters - wake up tasks waiting for ring buffer input
> + *
> + * Schedules a delayed work to wake up any task that is blocked on the
> + * ring buffer waiters queue.
> + */
> +static void rb_wake_up_waiters(struct irq_work *work)
> +{
> +	struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, work);
> +
> +	wake_up_all(&rbwork->waiters);
> +}
> +
> +/**
> + * ring_buffer_wait - wait for input to the ring buffer
> + * @buffer: buffer to wait on
> + * @cpu: the cpu buffer to wait on
> + *
> + * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
> + * as data is added to any of the @buffer's cpu buffers. Otherwise
> + * it will wait for data to be added to a specific cpu buffer.
> + */
> +void ring_buffer_wait(struct ring_buffer *buffer, int cpu)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	DEFINE_WAIT(wait);
> +	struct rb_irq_work *work;
> +
> +	/*
> +	 * Depending on what the caller is waiting for, either any
> +	 * data in any cpu buffer, or a specific buffer, put the
> +	 * caller on the appropriate wait queue.
> +	 */
> +	if (cpu == RING_BUFFER_ALL_CPUS)
> +		work = &buffer->irq_work;
> +	else {
> +		cpu_buffer = buffer->buffers[cpu];
> +		work = &cpu_buffer->irq_work;
> +	}
> +
> +
> +	prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE);
> +
> +	/*
> +	 * The events can happen in critical sections where
> +	 * checking a work queue can cause deadlocks.
> +	 * After adding a task to the queue, this flag is set
> +	 * only to notify events to try to wake up the queue
> +	 * using irq_work.
> +	 *
> +	 * We don't clear it even if the buffer is no longer
> +	 * empty. The flag only causes the next event to run
> +	 * irq_work to do the work queue wake up. The worse
> +	 * that can happen if we race with !trace_empty() is that
> +	 * an event will cause an irq_work to try to wake up
> +	 * an empty queue.
> +	 *
> +	 * There's no reason to protect this flag either, as
> +	 * the work queue and irq_work logic will do the necessary
> +	 * synchronization for the wake ups. The only thing
> +	 * that is necessary is that the wake up happens after
> +	 * a task has been queued. It's OK for spurious wake ups.
> +	 */
> +	work->waiters_pending = true;
> +
> +	if ((cpu == RING_BUFFER_ALL_CPUS && ring_buffer_empty(buffer)) ||
> +	    (cpu != RING_BUFFER_ALL_CPUS && ring_buffer_empty_cpu(buffer, cpu)))
> +		schedule();
> +
> +	finish_wait(&work->waiters, &wait);
> +}
> +
> +/**
> + * ring_buffer_poll_wait - poll on buffer input
> + * @buffer: buffer to wait on
> + * @cpu: the cpu buffer to wait on
> + * @filp: the file descriptor
> + * @poll_table: The poll descriptor
> + *
> + * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
> + * as data is added to any of the @buffer's cpu buffers. Otherwise
> + * it will wait for data to be added to a specific cpu buffer.
> + *
> + * Returns POLLIN | POLLRDNORM if data exists in the buffers,
> + * zero otherwise.
> + */
> +int ring_buffer_poll_wait(struct ring_buffer *buffer, int cpu,
> +			  struct file *filp, poll_table *poll_table)
> +{
> +	struct ring_buffer_per_cpu *cpu_buffer;
> +	struct rb_irq_work *work;
> +
> +	if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
> +	    (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu)))
> +		return POLLIN | POLLRDNORM;
> +
> +	if (cpu == RING_BUFFER_ALL_CPUS)
> +		work = &buffer->irq_work;
> +	else {
> +		cpu_buffer = buffer->buffers[cpu];
> +		work = &cpu_buffer->irq_work;
> +	}
> +
> +	work->waiters_pending = true;
> +	poll_wait(filp, &work->waiters, poll_table);
> +
> +	if ((cpu == RING_BUFFER_ALL_CPUS && !ring_buffer_empty(buffer)) ||
> +	    (cpu != RING_BUFFER_ALL_CPUS && !ring_buffer_empty_cpu(buffer, cpu)))
> +		return POLLIN | POLLRDNORM;
> +	return 0;
> +}
> +
>  /* buffer may be either ring_buffer or ring_buffer_per_cpu */
>  #define RB_WARN_ON(b, cond)						\
>  	({								\
> @@ -1061,6 +1186,8 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
>  	cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
>  	INIT_WORK(&cpu_buffer->update_pages_work, update_pages_handler);
>  	init_completion(&cpu_buffer->update_done);
> +	init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
> +	init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>  
>  	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>  			    GFP_KERNEL, cpu_to_node(cpu));
> @@ -1156,6 +1283,9 @@ struct ring_buffer *__ring_buffer_alloc(unsigned long size, unsigned flags,
>  	buffer->clock = trace_clock_local;
>  	buffer->reader_lock_key = key;
>  
> +	init_irq_work(&buffer->irq_work.work, rb_wake_up_waiters);
> +	init_waitqueue_head(&buffer->irq_work.waiters);
> +
>  	/* need at least two pages */
>  	if (nr_pages < 2)
>  		nr_pages = 2;
> @@ -1551,11 +1681,22 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
>  			if (!cpu_buffer->nr_pages_to_update)
>  				continue;
>  
> -			if (cpu_online(cpu))
> +			/* The update must run on the CPU that is being updated. */
> +			preempt_disable();
> +			if (cpu == smp_processor_id() || !cpu_online(cpu)) {
> +				rb_update_pages(cpu_buffer);
> +				cpu_buffer->nr_pages_to_update = 0;
> +			} else {
> +				/*
> +				 * Can not disable preemption for schedule_work_on()
> +				 * on PREEMPT_RT.
> +				 */
> +				preempt_enable();
>  				schedule_work_on(cpu,
>  						&cpu_buffer->update_pages_work);
> -			else
> -				rb_update_pages(cpu_buffer);
> +				preempt_disable();
> +			}
> +			preempt_enable();
>  		}
>  
>  		/* wait for all the updates to complete */
> @@ -1593,12 +1734,22 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size,
>  
>  		get_online_cpus();
>  
> -		if (cpu_online(cpu_id)) {
> +		preempt_disable();
> +		/* The update must run on the CPU that is being updated. */
> +		if (cpu_id == smp_processor_id() || !cpu_online(cpu_id))
> +			rb_update_pages(cpu_buffer);
> +		else {
> +			/*
> +			 * Can not disable preemption for schedule_work_on()
> +			 * on PREEMPT_RT.
> +			 */
> +			preempt_enable();
>  			schedule_work_on(cpu_id,
>  					 &cpu_buffer->update_pages_work);
>  			wait_for_completion(&cpu_buffer->update_done);
> -		} else
> -			rb_update_pages(cpu_buffer);
> +			preempt_disable();
> +		}
> +		preempt_enable();
>  
>  		cpu_buffer->nr_pages_to_update = 0;
>  		put_online_cpus();
> @@ -2610,6 +2761,22 @@ static void rb_commit(struct ring_buffer_per_cpu *cpu_buffer,
>  	rb_end_commit(cpu_buffer);
>  }
>  
> +static __always_inline void
> +rb_wakeups(struct ring_buffer *buffer, struct ring_buffer_per_cpu *cpu_buffer)
> +{
> +	if (buffer->irq_work.waiters_pending) {
> +		buffer->irq_work.waiters_pending = false;
> +		/* irq_work_queue() supplies it's own memory barriers */
> +		irq_work_queue(&buffer->irq_work.work);
> +	}
> +
> +	if (cpu_buffer->irq_work.waiters_pending) {
> +		cpu_buffer->irq_work.waiters_pending = false;
> +		/* irq_work_queue() supplies it's own memory barriers */
> +		irq_work_queue(&cpu_buffer->irq_work.work);
> +	}
> +}
> +
>  /**
>   * ring_buffer_unlock_commit - commit a reserved
>   * @buffer: The buffer to commit to
> @@ -2629,6 +2796,8 @@ int ring_buffer_unlock_commit(struct ring_buffer *buffer,
>  
>  	rb_commit(cpu_buffer, event);
>  
> +	rb_wakeups(buffer, cpu_buffer);
> +
>  	trace_recursive_unlock();
>  
>  	preempt_enable_notrace();
> @@ -2801,6 +2970,8 @@ int ring_buffer_write(struct ring_buffer *buffer,
>  
>  	rb_commit(cpu_buffer, event);
>  
> +	rb_wakeups(buffer, cpu_buffer);
> +
>  	ret = 0;
>   out:
>  	preempt_enable_notrace();
> @@ -4465,3 +4636,320 @@ static int rb_cpu_notify(struct notifier_block *self,
>  	return NOTIFY_OK;
>  }
>  #endif
> +
> +#ifdef CONFIG_RING_BUFFER_STARTUP_TEST
> +/*
> + * This is a basic integrity check of the ring buffer.
> + * Late in the boot cycle this test will run when configured in.
> + * It will kick off a thread per CPU that will go into a loop
> + * writing to the per cpu ring buffer various sizes of data.
> + * Some of the data will be large items, some small.
> + *
> + * Another thread is created that goes into a spin, sending out
> + * IPIs to the other CPUs to also write into the ring buffer.
> + * this is to test the nesting ability of the buffer.
> + *
> + * Basic stats are recorded and reported. If something in the
> + * ring buffer should happen that's not expected, a big warning
> + * is displayed and all ring buffers are disabled.
> + */
> +static struct task_struct *rb_threads[NR_CPUS] __initdata;
> +
> +struct rb_test_data {
> +	struct ring_buffer	*buffer;
> +	unsigned long		events;
> +	unsigned long		bytes_written;
> +	unsigned long		bytes_alloc;
> +	unsigned long		bytes_dropped;
> +	unsigned long		events_nested;
> +	unsigned long		bytes_written_nested;
> +	unsigned long		bytes_alloc_nested;
> +	unsigned long		bytes_dropped_nested;
> +	int			min_size_nested;
> +	int			max_size_nested;
> +	int			max_size;
> +	int			min_size;
> +	int			cpu;
> +	int			cnt;
> +};
> +
> +static struct rb_test_data rb_data[NR_CPUS] __initdata;
> +
> +/* 1 meg per cpu */
> +#define RB_TEST_BUFFER_SIZE	1048576
> +
> +static char rb_string[] __initdata =
> +	"abcdefghijklmnopqrstuvwxyz1234567890!@...^&*()?+\\"
> +	"?+|:';\",.<>/?abcdefghijklmnopqrstuvwxyz1234567890"
> +	"!@...^&*()?+\\?+|:';\",.<>/?abcdefghijklmnopqrstuv";
> +
> +static bool rb_test_started __initdata;
> +
> +struct rb_item {
> +	int size;
> +	char str[];
> +};
> +
> +static __init int rb_write_something(struct rb_test_data *data, bool nested)
> +{
> +	struct ring_buffer_event *event;
> +	struct rb_item *item;
> +	bool started;
> +	int event_len;
> +	int size;
> +	int len;
> +	int cnt;
> +
> +	/* Have nested writes different that what is written */
> +	cnt = data->cnt + (nested ? 27 : 0);
> +
> +	/* Multiply cnt by ~e, to make some unique increment */
> +	size = (data->cnt * 68 / 25) % (sizeof(rb_string) - 1);
> +
> +	len = size + sizeof(struct rb_item);
> +
> +	started = rb_test_started;
> +	/* read rb_test_started before checking buffer enabled */
> +	smp_rmb();
> +
> +	event = ring_buffer_lock_reserve(data->buffer, len);
> +	if (!event) {
> +		/* Ignore dropped events before test starts. */
> +		if (started) {
> +			if (nested)
> +				data->bytes_dropped += len;
> +			else
> +				data->bytes_dropped_nested += len;
> +		}
> +		return len;
> +	}
> +
> +	event_len = ring_buffer_event_length(event);
> +
> +	if (RB_WARN_ON(data->buffer, event_len < len))
> +		goto out;
> +
> +	item = ring_buffer_event_data(event);
> +	item->size = size;
> +	memcpy(item->str, rb_string, size);
> +
> +	if (nested) {
> +		data->bytes_alloc_nested += event_len;
> +		data->bytes_written_nested += len;
> +		data->events_nested++;
> +		if (!data->min_size_nested || len < data->min_size_nested)
> +			data->min_size_nested = len;
> +		if (len > data->max_size_nested)
> +			data->max_size_nested = len;
> +	} else {
> +		data->bytes_alloc += event_len;
> +		data->bytes_written += len;
> +		data->events++;
> +		if (!data->min_size || len < data->min_size)
> +			data->max_size = len;
> +		if (len > data->max_size)
> +			data->max_size = len;
> +	}
> +
> + out:
> +	ring_buffer_unlock_commit(data->buffer, event);
> +
> +	return 0;
> +}
> +
> +static __init int rb_test(void *arg)
> +{
> +	struct rb_test_data *data = arg;
> +
> +	while (!kthread_should_stop()) {
> +		rb_write_something(data, false);
> +		data->cnt++;
> +
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		/* Now sleep between a min of 100-300us and a max of 1ms */
> +		usleep_range(((data->cnt % 3) + 1) * 100, 1000);
> +	}
> +
> +	return 0;
> +}
> +
> +static __init void rb_ipi(void *ignore)
> +{
> +	struct rb_test_data *data;
> +	int cpu = smp_processor_id();
> +
> +	data = &rb_data[cpu];
> +	rb_write_something(data, true);
> +}
> +
> +static __init int rb_hammer_test(void *arg)
> +{
> +	while (!kthread_should_stop()) {
> +
> +		/* Send an IPI to all cpus to write data! */
> +		smp_call_function(rb_ipi, NULL, 1);
> +		/* No sleep, but for non preempt, let others run */
> +		schedule();
> +	}
> +
> +	return 0;
> +}
> +
> +static __init int test_ringbuffer(void)
> +{
> +	struct task_struct *rb_hammer;
> +	struct ring_buffer *buffer;
> +	int cpu;
> +	int ret = 0;
> +
> +	pr_info("Running ring buffer tests...\n");
> +
> +	buffer = ring_buffer_alloc(RB_TEST_BUFFER_SIZE, RB_FL_OVERWRITE);
> +	if (WARN_ON(!buffer))
> +		return 0;
> +
> +	/* Disable buffer so that threads can't write to it yet */
> +	ring_buffer_record_off(buffer);
> +
> +	for_each_online_cpu(cpu) {
> +		rb_data[cpu].buffer = buffer;
> +		rb_data[cpu].cpu = cpu;
> +		rb_data[cpu].cnt = cpu;
> +		rb_threads[cpu] = kthread_create(rb_test, &rb_data[cpu],
> +						 "rbtester/%d", cpu);
> +		if (WARN_ON(!rb_threads[cpu])) {
> +			pr_cont("FAILED\n");
> +			ret = -1;
> +			goto out_free;
> +		}
> +
> +		kthread_bind(rb_threads[cpu], cpu);
> + 		wake_up_process(rb_threads[cpu]);
> +	}
> +
> +	/* Now create the rb hammer! */
> +	rb_hammer = kthread_run(rb_hammer_test, NULL, "rbhammer");
> +	if (WARN_ON(!rb_hammer)) {
> +		pr_cont("FAILED\n");
> +		ret = -1;
> +		goto out_free;
> +	}
> +
> +	ring_buffer_record_on(buffer);
> +	/*
> +	 * Show buffer is enabled before setting rb_test_started.
> +	 * Yes there's a small race window where events could be
> +	 * dropped and the thread wont catch it. But when a ring
> +	 * buffer gets enabled, there will always be some kind of
> +	 * delay before other CPUs see it. Thus, we don't care about
> +	 * those dropped events. We care about events dropped after
> +	 * the threads see that the buffer is active.
> +	 */
> +	smp_wmb();
> +	rb_test_started = true;
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	/* Just run for 10 seconds */;
> +	schedule_timeout(10 * HZ);
> +
> +	kthread_stop(rb_hammer);
> +
> + out_free:
> +	for_each_online_cpu(cpu) {
> +		if (!rb_threads[cpu])
> +			break;
> +		kthread_stop(rb_threads[cpu]);
> +	}
> +	if (ret) {
> +		ring_buffer_free(buffer);
> +		return ret;
> +	}
> +
> +	/* Report! */
> +	pr_info("finished\n");
> +	for_each_online_cpu(cpu) {
> +		struct ring_buffer_event *event;
> +		struct rb_test_data *data = &rb_data[cpu];
> +		struct rb_item *item;
> +		unsigned long total_events;
> +		unsigned long total_dropped;
> +		unsigned long total_written;
> +		unsigned long total_alloc;
> +		unsigned long total_read = 0;
> +		unsigned long total_size = 0;
> +		unsigned long total_len = 0;
> +		unsigned long total_lost = 0;
> +		unsigned long lost;
> +		int big_event_size;
> +		int small_event_size;
> +
> +		ret = -1;
> +
> +		total_events = data->events + data->events_nested;
> +		total_written = data->bytes_written + data->bytes_written_nested;
> +		total_alloc = data->bytes_alloc + data->bytes_alloc_nested;
> +		total_dropped = data->bytes_dropped + data->bytes_dropped_nested;
> +
> +		big_event_size = data->max_size + data->max_size_nested;
> +		small_event_size = data->min_size + data->min_size_nested;
> +
> +		pr_info("CPU %d:\n", cpu);
> +		pr_info("              events:    %ld\n", total_events);
> +		pr_info("       dropped bytes:    %ld\n", total_dropped);
> +		pr_info("       alloced bytes:    %ld\n", total_alloc);
> +		pr_info("       written bytes:    %ld\n", total_written);
> +		pr_info("       biggest event:    %d\n", big_event_size);
> +		pr_info("      smallest event:    %d\n", small_event_size);
> +
> +		if (RB_WARN_ON(buffer, total_dropped))
> +			break;
> +
> +		ret = 0;
> +
> +		while ((event = ring_buffer_consume(buffer, cpu, NULL, &lost))) {
> +			total_lost += lost;
> +			item = ring_buffer_event_data(event);
> +			total_len += ring_buffer_event_length(event);
> +			total_size += item->size + sizeof(struct rb_item);
> +			if (memcmp(&item->str[0], rb_string, item->size) != 0) {
> +				pr_info("FAILED!\n");
> +				pr_info("buffer had: %.*s\n", item->size, item->str);
> +				pr_info("expected:   %.*s\n", item->size, rb_string);
> +				RB_WARN_ON(buffer, 1);
> +				ret = -1;
> +				break;
> +			}
> +			total_read++;
> +		}
> +		if (ret)
> +			break;
> +
> +		ret = -1;
> +
> +		pr_info("         read events:   %ld\n", total_read);
> +		pr_info("         lost events:   %ld\n", total_lost);
> +		pr_info("        total events:   %ld\n", total_lost + total_read);
> +		pr_info("  recorded len bytes:   %ld\n", total_len);
> +		pr_info(" recorded size bytes:   %ld\n", total_size);
> +		if (total_lost)
> +			pr_info(" With dropped events, record len and size may not match\n"
> +				" alloced and written from above\n");
> +		if (!total_lost) {
> +			if (RB_WARN_ON(buffer, total_len != total_alloc ||
> +				       total_size != total_written))
> +				break;
> +		}
> +		if (RB_WARN_ON(buffer, total_lost + total_read != total_events))
> +			break;
> +
> +		ret = 0;
> +	}
> +	if (!ret)
> +		pr_info("Ring buffer PASSED!\n");
> +
> +	ring_buffer_free(buffer);
> +	return 0;
> +}
> +
> +late_initcall(test_ringbuffer);
> +#endif /* CONFIG_RING_BUFFER_STARTUP_TEST */
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 4f1dade..829b2be 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1,7 +1,7 @@
>  /*
>   * ring buffer based function tracer
>   *
> - * Copyright (C) 2007-2008 Steven Rostedt <srostedt@...hat.com>
> + * Copyright (C) 2007-2012 Steven Rostedt <srostedt@...hat.com>
>   * Copyright (C) 2008 Ingo Molnar <mingo@...hat.com>
>   *
>   * Originally taken from the RT patch by:
> @@ -19,7 +19,6 @@
>  #include <linux/seq_file.h>
>  #include <linux/notifier.h>
>  #include <linux/irqflags.h>
> -#include <linux/irq_work.h>
>  #include <linux/debugfs.h>
>  #include <linux/pagemap.h>
>  #include <linux/hardirq.h>
> @@ -48,7 +47,7 @@
>   * On boot up, the ring buffer is set to the minimum size, so that
>   * we do not waste memory on systems that are not using tracing.
>   */
> -int ring_buffer_expanded;
> +bool ring_buffer_expanded;
>  
>  /*
>   * We need to change this state when a selftest is running.
> @@ -87,14 +86,6 @@ static int dummy_set_flag(u32 old_flags, u32 bit, int set)
>  static DEFINE_PER_CPU(bool, trace_cmdline_save);
>  
>  /*
> - * When a reader is waiting for data, then this variable is
> - * set to true.
> - */
> -static bool trace_wakeup_needed;
> -
> -static struct irq_work trace_work_wakeup;
> -
> -/*
>   * Kill all tracing for good (never come back).
>   * It is initialized to 1 but will turn to zero if the initialization
>   * of the tracer is successful. But that is the only place that sets
> @@ -130,12 +121,14 @@ static int tracing_set_tracer(const char *buf);
>  static char bootup_tracer_buf[MAX_TRACER_SIZE] __initdata;
>  static char *default_bootup_tracer;
>  
> +static bool allocate_snapshot;
> +
>  static int __init set_cmdline_ftrace(char *str)
>  {
>  	strncpy(bootup_tracer_buf, str, MAX_TRACER_SIZE);
>  	default_bootup_tracer = bootup_tracer_buf;
>  	/* We are using ftrace early, expand it */
> -	ring_buffer_expanded = 1;
> +	ring_buffer_expanded = true;
>  	return 1;
>  }
>  __setup("ftrace=", set_cmdline_ftrace);
> @@ -156,6 +149,15 @@ static int __init set_ftrace_dump_on_oops(char *str)
>  }
>  __setup("ftrace_dump_on_oops", set_ftrace_dump_on_oops);
>  
> +static int __init boot_alloc_snapshot(char *str)
> +{
> +	allocate_snapshot = true;
> +	/* We also need the main ring buffer expanded */
> +	ring_buffer_expanded = true;
> +	return 1;
> +}
> +__setup("alloc_snapshot", boot_alloc_snapshot);
> +
>  
>  static char trace_boot_options_buf[MAX_TRACER_SIZE] __initdata;
>  static char *trace_boot_options __initdata;
> @@ -189,7 +191,7 @@ unsigned long long ns2usecs(cycle_t nsec)
>   */
>  static struct trace_array	global_trace;
>  
> -static DEFINE_PER_CPU(struct trace_array_cpu, global_trace_cpu);
> +LIST_HEAD(ftrace_trace_arrays);
>  
>  int filter_current_check_discard(struct ring_buffer *buffer,
>  				 struct ftrace_event_call *call, void *rec,
> @@ -204,29 +206,15 @@ cycle_t ftrace_now(int cpu)
>  	u64 ts;
>  
>  	/* Early boot up does not have a buffer yet */
> -	if (!global_trace.buffer)
> +	if (!global_trace.trace_buffer.buffer)
>  		return trace_clock_local();
>  
> -	ts = ring_buffer_time_stamp(global_trace.buffer, cpu);
> -	ring_buffer_normalize_time_stamp(global_trace.buffer, cpu, &ts);
> +	ts = ring_buffer_time_stamp(global_trace.trace_buffer.buffer, cpu);
> +	ring_buffer_normalize_time_stamp(global_trace.trace_buffer.buffer, cpu, &ts);
>  
>  	return ts;
>  }
>  
> -/*
> - * The max_tr is used to snapshot the global_trace when a maximum
> - * latency is reached. Some tracers will use this to store a maximum
> - * trace while it continues examining live traces.
> - *
> - * The buffers for the max_tr are set up the same as the global_trace.
> - * When a snapshot is taken, the link list of the max_tr is swapped
> - * with the link list of the global_trace and the buffers are reset for
> - * the global_trace so the tracing can continue.
> - */
> -static struct trace_array	max_tr;
> -
> -static DEFINE_PER_CPU(struct trace_array_cpu, max_tr_data);
> -
>  int tracing_is_enabled(void)
>  {
>  	return tracing_is_on();
> @@ -249,9 +237,6 @@ static unsigned long		trace_buf_size = TRACE_BUF_SIZE_DEFAULT;
>  /* trace_types holds a link list of available tracers. */
>  static struct tracer		*trace_types __read_mostly;
>  
> -/* current_trace points to the tracer that is currently active */
> -static struct tracer		*current_trace __read_mostly = &nop_trace;
> -
>  /*
>   * trace_types_lock is used to protect the trace_types list.
>   */
> @@ -285,13 +270,13 @@ static DEFINE_PER_CPU(struct mutex, cpu_access_lock);
>  
>  static inline void trace_access_lock(int cpu)
>  {
> -	if (cpu == TRACE_PIPE_ALL_CPU) {
> +	if (cpu == RING_BUFFER_ALL_CPUS) {
>  		/* gain it for accessing the whole ring buffer. */
>  		down_write(&all_cpu_access_lock);
>  	} else {
>  		/* gain it for accessing a cpu ring buffer. */
>  
> -		/* Firstly block other trace_access_lock(TRACE_PIPE_ALL_CPU). */
> +		/* Firstly block other trace_access_lock(RING_BUFFER_ALL_CPUS). */
>  		down_read(&all_cpu_access_lock);
>  
>  		/* Secondly block other access to this @cpu ring buffer. */
> @@ -301,7 +286,7 @@ static inline void trace_access_lock(int cpu)
>  
>  static inline void trace_access_unlock(int cpu)
>  {
> -	if (cpu == TRACE_PIPE_ALL_CPU) {
> +	if (cpu == RING_BUFFER_ALL_CPUS) {
>  		up_write(&all_cpu_access_lock);
>  	} else {
>  		mutex_unlock(&per_cpu(cpu_access_lock, cpu));
> @@ -339,30 +324,11 @@ static inline void trace_access_lock_init(void)
>  
>  #endif
>  
> -/* trace_wait is a waitqueue for tasks blocked on trace_poll */
> -static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> -
>  /* trace_flags holds trace_options default values */
>  unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>  	TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
>  	TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE |
> -	TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS;
> -
> -static int trace_stop_count;
> -static DEFINE_RAW_SPINLOCK(tracing_start_lock);
> -
> -/**
> - * trace_wake_up - wake up tasks waiting for trace input
> - *
> - * Schedules a delayed work to wake up any task that is blocked on the
> - * trace_wait queue. These is used with trace_poll for tasks polling the
> - * trace.
> - */
> -static void trace_wake_up(struct irq_work *work)
> -{
> -	wake_up_all(&trace_wait);
> -
> -}
> +	TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | TRACE_ITER_FUNCTION;
>  
>  /**
>   * tracing_on - enable tracing buffers
> @@ -372,8 +338,8 @@ static void trace_wake_up(struct irq_work *work)
>   */
>  void tracing_on(void)
>  {
> -	if (global_trace.buffer)
> -		ring_buffer_record_on(global_trace.buffer);
> +	if (global_trace.trace_buffer.buffer)
> +		ring_buffer_record_on(global_trace.trace_buffer.buffer);
>  	/*
>  	 * This flag is only looked at when buffers haven't been
>  	 * allocated yet. We don't really care about the race
> @@ -385,6 +351,196 @@ void tracing_on(void)
>  EXPORT_SYMBOL_GPL(tracing_on);
>  
>  /**
> + * __trace_puts - write a constant string into the trace buffer.
> + * @ip:	   The address of the caller
> + * @str:   The constant string to write
> + * @size:  The size of the string.
> + */
> +int __trace_puts(unsigned long ip, const char *str, int size)
> +{
> +	struct ring_buffer_event *event;
> +	struct ring_buffer *buffer;
> +	struct print_entry *entry;
> +	unsigned long irq_flags;
> +	int alloc;
> +
> +	alloc = sizeof(*entry) + size + 2; /* possible \n added */
> +
> +	local_save_flags(irq_flags);
> +	buffer = global_trace.trace_buffer.buffer;
> +	event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, alloc, 
> +					  irq_flags, preempt_count());
> +	if (!event)
> +		return 0;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->ip = ip;
> +
> +	memcpy(&entry->buf, str, size);
> +
> +	/* Add a newline if necessary */
> +	if (entry->buf[size - 1] != '\n') {
> +		entry->buf[size] = '\n';
> +		entry->buf[size + 1] = '\0';
> +	} else
> +		entry->buf[size] = '\0';
> +
> +	__buffer_unlock_commit(buffer, event);
> +
> +	return size;
> +}
> +EXPORT_SYMBOL_GPL(__trace_puts);
> +
> +/**
> + * __trace_bputs - write the pointer to a constant string into trace buffer
> + * @ip:	   The address of the caller
> + * @str:   The constant string to write to the buffer to
> + */
> +int __trace_bputs(unsigned long ip, const char *str)
> +{
> +	struct ring_buffer_event *event;
> +	struct ring_buffer *buffer;
> +	struct bputs_entry *entry;
> +	unsigned long irq_flags;
> +	int size = sizeof(struct bputs_entry);
> +
> +	local_save_flags(irq_flags);
> +	buffer = global_trace.trace_buffer.buffer;
> +	event = trace_buffer_lock_reserve(buffer, TRACE_BPUTS, size,
> +					  irq_flags, preempt_count());
> +	if (!event)
> +		return 0;
> +
> +	entry = ring_buffer_event_data(event);
> +	entry->ip			= ip;
> +	entry->str			= str;
> +
> +	__buffer_unlock_commit(buffer, event);
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(__trace_bputs);
> +
> +#ifdef CONFIG_TRACER_SNAPSHOT
> +/**
> + * trace_snapshot - take a snapshot of the current buffer.
> + *
> + * This causes a swap between the snapshot buffer and the current live
> + * tracing buffer. You can use this to take snapshots of the live
> + * trace when some condition is triggered, but continue to trace.
> + *
> + * Note, make sure to allocate the snapshot with either
> + * a tracing_snapshot_alloc(), or by doing it manually
> + * with: echo 1 > /sys/kernel/debug/tracing/snapshot
> + *
> + * If the snapshot buffer is not allocated, it will stop tracing.
> + * Basically making a permanent snapshot.
> + */
> +void tracing_snapshot(void)
> +{
> +	struct trace_array *tr = &global_trace;
> +	struct tracer *tracer = tr->current_trace;
> +	unsigned long flags;
> +
> +	if (in_nmi()) {
> +		internal_trace_puts("*** SNAPSHOT CALLED FROM NMI CONTEXT ***\n");
> +		internal_trace_puts("*** snapshot is being ignored        ***\n");
> +		return;
> +	}
> +
> +	if (!tr->allocated_snapshot) {
> +		internal_trace_puts("*** SNAPSHOT NOT ALLOCATED ***\n");
> +		internal_trace_puts("*** stopping trace here!   ***\n");
> +		tracing_off();
> +		return;
> +	}
> +
> +	/* Note, snapshot can not be used when the tracer uses it */
> +	if (tracer->use_max_tr) {
> +		internal_trace_puts("*** LATENCY TRACER ACTIVE ***\n");
> +		internal_trace_puts("*** Can not use snapshot (sorry) ***\n");
> +		return;
> +	}
> +
> +	local_irq_save(flags);
> +	update_max_tr(tr, current, smp_processor_id());
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(tracing_snapshot);
> +
> +static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf,
> +					struct trace_buffer *size_buf, int cpu_id);
> +static void set_buffer_entries(struct trace_buffer *buf, unsigned long val);
> +
> +static int alloc_snapshot(struct trace_array *tr)
> +{
> +	int ret;
> +
> +	if (!tr->allocated_snapshot) {
> +
> +		/* allocate spare buffer */
> +		ret = resize_buffer_duplicate_size(&tr->max_buffer,
> +				   &tr->trace_buffer, RING_BUFFER_ALL_CPUS);
> +		if (ret < 0)
> +			return ret;
> +
> +		tr->allocated_snapshot = true;
> +	}
> +
> +	return 0;
> +}
> +
> +void free_snapshot(struct trace_array *tr)
> +{
> +	/*
> +	 * We don't free the ring buffer. instead, resize it because
> +	 * The max_tr ring buffer has some state (e.g. ring->clock) and
> +	 * we want preserve it.
> +	 */
> +	ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
> +	set_buffer_entries(&tr->max_buffer, 1);
> +	tracing_reset_online_cpus(&tr->max_buffer);
> +	tr->allocated_snapshot = false;
> +}
> +
> +/**
> + * trace_snapshot_alloc - allocate and take a snapshot of the current buffer.
> + *
> + * This is similar to trace_snapshot(), but it will allocate the
> + * snapshot buffer if it isn't already allocated. Use this only
> + * where it is safe to sleep, as the allocation may sleep.
> + *
> + * This causes a swap between the snapshot buffer and the current live
> + * tracing buffer. You can use this to take snapshots of the live
> + * trace when some condition is triggered, but continue to trace.
> + */
> +void tracing_snapshot_alloc(void)
> +{
> +	struct trace_array *tr = &global_trace;
> +	int ret;
> +
> +	ret = alloc_snapshot(tr);
> +	if (WARN_ON(ret < 0))
> +		return;
> +
> +	tracing_snapshot();
> +}
> +EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
> +#else
> +void tracing_snapshot(void)
> +{
> +	WARN_ONCE(1, "Snapshot feature not enabled, but internal snapshot used");
> +}
> +EXPORT_SYMBOL_GPL(tracing_snapshot);
> +void tracing_snapshot_alloc(void)
> +{
> +	/* Give warning */
> +	tracing_snapshot();
> +}
> +EXPORT_SYMBOL_GPL(tracing_snapshot_alloc);
> +#endif /* CONFIG_TRACER_SNAPSHOT */
> +
> +/**
>   * tracing_off - turn off tracing buffers
>   *
>   * This function stops the tracing buffers from recording data.
> @@ -394,8 +550,8 @@ EXPORT_SYMBOL_GPL(tracing_on);
>   */
>  void tracing_off(void)
>  {
> -	if (global_trace.buffer)
> -		ring_buffer_record_off(global_trace.buffer);
> +	if (global_trace.trace_buffer.buffer)
> +		ring_buffer_record_off(global_trace.trace_buffer.buffer);
>  	/*
>  	 * This flag is only looked at when buffers haven't been
>  	 * allocated yet. We don't really care about the race
> @@ -411,8 +567,8 @@ EXPORT_SYMBOL_GPL(tracing_off);
>   */
>  int tracing_is_on(void)
>  {
> -	if (global_trace.buffer)
> -		return ring_buffer_record_is_on(global_trace.buffer);
> +	if (global_trace.trace_buffer.buffer)
> +		return ring_buffer_record_is_on(global_trace.trace_buffer.buffer);
>  	return !global_trace.buffer_disabled;
>  }
>  EXPORT_SYMBOL_GPL(tracing_is_on);
> @@ -479,6 +635,7 @@ static const char *trace_options[] = {
>  	"disable_on_free",
>  	"irq-info",
>  	"markers",
> +	"function-trace",
>  	NULL
>  };
>  
> @@ -490,6 +647,8 @@ static struct {
>  	{ trace_clock_local,	"local",	1 },
>  	{ trace_clock_global,	"global",	1 },
>  	{ trace_clock_counter,	"counter",	0 },
> +	{ trace_clock_jiffies,	"uptime",	1 },
> +	{ trace_clock,		"perf",		1 },
>  	ARCH_TRACE_CLOCKS
>  };
>  
> @@ -670,13 +829,14 @@ unsigned long __read_mostly	tracing_max_latency;
>  static void
>  __update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  {
> -	struct trace_array_cpu *data = tr->data[cpu];
> -	struct trace_array_cpu *max_data;
> +	struct trace_buffer *trace_buf = &tr->trace_buffer;
> +	struct trace_buffer *max_buf = &tr->max_buffer;
> +	struct trace_array_cpu *data = per_cpu_ptr(trace_buf->data, cpu);
> +	struct trace_array_cpu *max_data = per_cpu_ptr(max_buf->data, cpu);
>  
> -	max_tr.cpu = cpu;
> -	max_tr.time_start = data->preempt_timestamp;
> +	max_buf->cpu = cpu;
> +	max_buf->time_start = data->preempt_timestamp;
>  
> -	max_data = max_tr.data[cpu];
>  	max_data->saved_latency = tracing_max_latency;
>  	max_data->critical_start = data->critical_start;
>  	max_data->critical_end = data->critical_end;
> @@ -706,22 +866,22 @@ update_max_tr(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  {
>  	struct ring_buffer *buf;
>  
> -	if (trace_stop_count)
> +	if (tr->stop_count)
>  		return;
>  
>  	WARN_ON_ONCE(!irqs_disabled());
>  
> -	if (!current_trace->allocated_snapshot) {
> +	if (!tr->allocated_snapshot) {
>  		/* Only the nop tracer should hit this when disabling */
> -		WARN_ON_ONCE(current_trace != &nop_trace);
> +		WARN_ON_ONCE(tr->current_trace != &nop_trace);
>  		return;
>  	}
>  
>  	arch_spin_lock(&ftrace_max_lock);
>  
> -	buf = tr->buffer;
> -	tr->buffer = max_tr.buffer;
> -	max_tr.buffer = buf;
> +	buf = tr->trace_buffer.buffer;
> +	tr->trace_buffer.buffer = tr->max_buffer.buffer;
> +	tr->max_buffer.buffer = buf;
>  
>  	__update_max_tr(tr, tsk, cpu);
>  	arch_spin_unlock(&ftrace_max_lock);
> @@ -740,16 +900,16 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  {
>  	int ret;
>  
> -	if (trace_stop_count)
> +	if (tr->stop_count)
>  		return;
>  
>  	WARN_ON_ONCE(!irqs_disabled());
> -	if (WARN_ON_ONCE(!current_trace->allocated_snapshot))
> +	if (WARN_ON_ONCE(!tr->allocated_snapshot))
>  		return;
>  
>  	arch_spin_lock(&ftrace_max_lock);
>  
> -	ret = ring_buffer_swap_cpu(max_tr.buffer, tr->buffer, cpu);
> +	ret = ring_buffer_swap_cpu(tr->max_buffer.buffer, tr->trace_buffer.buffer, cpu);
>  
>  	if (ret == -EBUSY) {
>  		/*
> @@ -758,7 +918,7 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  		 * the max trace buffer (no one writes directly to it)
>  		 * and flag that it failed.
>  		 */
> -		trace_array_printk(&max_tr, _THIS_IP_,
> +		trace_array_printk_buf(tr->max_buffer.buffer, _THIS_IP_,
>  			"Failed to swap buffers due to commit in progress\n");
>  	}
>  
> @@ -771,37 +931,78 @@ update_max_tr_single(struct trace_array *tr, struct task_struct *tsk, int cpu)
>  
>  static void default_wait_pipe(struct trace_iterator *iter)
>  {
> -	DEFINE_WAIT(wait);
> +	/* Iterators are static, they should be filled or empty */
> +	if (trace_buffer_iter(iter, iter->cpu_file))
> +		return;
> +
> +	ring_buffer_wait(iter->trace_buffer->buffer, iter->cpu_file);
> +}
> +
> +#ifdef CONFIG_FTRACE_STARTUP_TEST
> +static int run_tracer_selftest(struct tracer *type)
> +{
> +	struct trace_array *tr = &global_trace;
> +	struct tracer *saved_tracer = tr->current_trace;
> +	int ret;
>  
> -	prepare_to_wait(&trace_wait, &wait, TASK_INTERRUPTIBLE);
> +	if (!type->selftest || tracing_selftest_disabled)
> +		return 0;
>  
>  	/*
> -	 * The events can happen in critical sections where
> -	 * checking a work queue can cause deadlocks.
> -	 * After adding a task to the queue, this flag is set
> -	 * only to notify events to try to wake up the queue
> -	 * using irq_work.
> -	 *
> -	 * We don't clear it even if the buffer is no longer
> -	 * empty. The flag only causes the next event to run
> -	 * irq_work to do the work queue wake up. The worse
> -	 * that can happen if we race with !trace_empty() is that
> -	 * an event will cause an irq_work to try to wake up
> -	 * an empty queue.
> -	 *
> -	 * There's no reason to protect this flag either, as
> -	 * the work queue and irq_work logic will do the necessary
> -	 * synchronization for the wake ups. The only thing
> -	 * that is necessary is that the wake up happens after
> -	 * a task has been queued. It's OK for spurious wake ups.
> +	 * Run a selftest on this tracer.
> +	 * Here we reset the trace buffer, and set the current
> +	 * tracer to be this tracer. The tracer can then run some
> +	 * internal tracing to verify that everything is in order.
> +	 * If we fail, we do not register this tracer.
>  	 */
> -	trace_wakeup_needed = true;
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  
> -	if (trace_empty(iter))
> -		schedule();
> +	tr->current_trace = type;
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (type->use_max_tr) {
> +		/* If we expanded the buffers, make sure the max is expanded too */
> +		if (ring_buffer_expanded)
> +			ring_buffer_resize(tr->max_buffer.buffer, trace_buf_size,
> +					   RING_BUFFER_ALL_CPUS);
> +		tr->allocated_snapshot = true;
> +	}
> +#endif
> +
> +	/* the test is responsible for initializing and enabling */
> +	pr_info("Testing tracer %s: ", type->name);
> +	ret = type->selftest(type, tr);
> +	/* the test is responsible for resetting too */
> +	tr->current_trace = saved_tracer;
> +	if (ret) {
> +		printk(KERN_CONT "FAILED!\n");
> +		/* Add the warning after printing 'FAILED' */
> +		WARN_ON(1);
> +		return -1;
> +	}
> +	/* Only reset on passing, to avoid touching corrupted buffers */
> +	tracing_reset_online_cpus(&tr->trace_buffer);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (type->use_max_tr) {
> +		tr->allocated_snapshot = false;
>  
> -	finish_wait(&trace_wait, &wait);
> +		/* Shrink the max buffer again */
> +		if (ring_buffer_expanded)
> +			ring_buffer_resize(tr->max_buffer.buffer, 1,
> +					   RING_BUFFER_ALL_CPUS);
> +	}
> +#endif
> +
> +	printk(KERN_CONT "PASSED\n");
> +	return 0;
> +}
> +#else
> +static inline int run_tracer_selftest(struct tracer *type)
> +{
> +	return 0;
>  }
> +#endif /* CONFIG_FTRACE_STARTUP_TEST */
>  
>  /**
>   * register_tracer - register a tracer with the ftrace system.
> @@ -848,57 +1049,9 @@ int register_tracer(struct tracer *type)
>  	if (!type->wait_pipe)
>  		type->wait_pipe = default_wait_pipe;
>  
> -
> -#ifdef CONFIG_FTRACE_STARTUP_TEST
> -	if (type->selftest && !tracing_selftest_disabled) {
> -		struct tracer *saved_tracer = current_trace;
> -		struct trace_array *tr = &global_trace;
> -
> -		/*
> -		 * Run a selftest on this tracer.
> -		 * Here we reset the trace buffer, and set the current
> -		 * tracer to be this tracer. The tracer can then run some
> -		 * internal tracing to verify that everything is in order.
> -		 * If we fail, we do not register this tracer.
> -		 */
> -		tracing_reset_online_cpus(tr);
> -
> -		current_trace = type;
> -
> -		if (type->use_max_tr) {
> -			/* If we expanded the buffers, make sure the max is expanded too */
> -			if (ring_buffer_expanded)
> -				ring_buffer_resize(max_tr.buffer, trace_buf_size,
> -						   RING_BUFFER_ALL_CPUS);
> -			type->allocated_snapshot = true;
> -		}
> -
> -		/* the test is responsible for initializing and enabling */
> -		pr_info("Testing tracer %s: ", type->name);
> -		ret = type->selftest(type, tr);
> -		/* the test is responsible for resetting too */
> -		current_trace = saved_tracer;
> -		if (ret) {
> -			printk(KERN_CONT "FAILED!\n");
> -			/* Add the warning after printing 'FAILED' */
> -			WARN_ON(1);
> -			goto out;
> -		}
> -		/* Only reset on passing, to avoid touching corrupted buffers */
> -		tracing_reset_online_cpus(tr);
> -
> -		if (type->use_max_tr) {
> -			type->allocated_snapshot = false;
> -
> -			/* Shrink the max buffer again */
> -			if (ring_buffer_expanded)
> -				ring_buffer_resize(max_tr.buffer, 1,
> -						   RING_BUFFER_ALL_CPUS);
> -		}
> -
> -		printk(KERN_CONT "PASSED\n");
> -	}
> -#endif
> +	ret = run_tracer_selftest(type);
> +	if (ret < 0)
> +		goto out;
>  
>  	type->next = trace_types;
>  	trace_types = type;
> @@ -918,7 +1071,7 @@ int register_tracer(struct tracer *type)
>  	tracing_set_tracer(type->name);
>  	default_bootup_tracer = NULL;
>  	/* disable other selftests, since this will break it. */
> -	tracing_selftest_disabled = 1;
> +	tracing_selftest_disabled = true;
>  #ifdef CONFIG_FTRACE_STARTUP_TEST
>  	printk(KERN_INFO "Disabling FTRACE selftests due to running tracer '%s'\n",
>  	       type->name);
> @@ -928,9 +1081,9 @@ int register_tracer(struct tracer *type)
>  	return ret;
>  }
>  
> -void tracing_reset(struct trace_array *tr, int cpu)
> +void tracing_reset(struct trace_buffer *buf, int cpu)
>  {
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = buf->buffer;
>  
>  	if (!buffer)
>  		return;
> @@ -944,9 +1097,9 @@ void tracing_reset(struct trace_array *tr, int cpu)
>  	ring_buffer_record_enable(buffer);
>  }
>  
> -void tracing_reset_online_cpus(struct trace_array *tr)
> +void tracing_reset_online_cpus(struct trace_buffer *buf)
>  {
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = buf->buffer;
>  	int cpu;
>  
>  	if (!buffer)
> @@ -957,7 +1110,7 @@ void tracing_reset_online_cpus(struct trace_array *tr)
>  	/* Make sure all commits have finished */
>  	synchronize_sched();
>  
> -	tr->time_start = ftrace_now(tr->cpu);
> +	buf->time_start = ftrace_now(buf->cpu);
>  
>  	for_each_online_cpu(cpu)
>  		ring_buffer_reset_cpu(buffer, cpu);
> @@ -967,12 +1120,21 @@ void tracing_reset_online_cpus(struct trace_array *tr)
>  
>  void tracing_reset_current(int cpu)
>  {
> -	tracing_reset(&global_trace, cpu);
> +	tracing_reset(&global_trace.trace_buffer, cpu);
>  }
>  
> -void tracing_reset_current_online_cpus(void)
> +void tracing_reset_all_online_cpus(void)
>  {
> -	tracing_reset_online_cpus(&global_trace);
> +	struct trace_array *tr;
> +
> +	mutex_lock(&trace_types_lock);
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
> +		tracing_reset_online_cpus(&tr->trace_buffer);
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +		tracing_reset_online_cpus(&tr->max_buffer);
> +#endif
> +	}
> +	mutex_unlock(&trace_types_lock);
>  }
>  
>  #define SAVED_CMDLINES 128
> @@ -995,7 +1157,7 @@ static void trace_init_cmdlines(void)
>  
>  int is_tracing_stopped(void)
>  {
> -	return trace_stop_count;
> +	return global_trace.stop_count;
>  }
>  
>  /**
> @@ -1027,12 +1189,12 @@ void tracing_start(void)
>  	if (tracing_disabled)
>  		return;
>  
> -	raw_spin_lock_irqsave(&tracing_start_lock, flags);
> -	if (--trace_stop_count) {
> -		if (trace_stop_count < 0) {
> +	raw_spin_lock_irqsave(&global_trace.start_lock, flags);
> +	if (--global_trace.stop_count) {
> +		if (global_trace.stop_count < 0) {
>  			/* Someone screwed up their debugging */
>  			WARN_ON_ONCE(1);
> -			trace_stop_count = 0;
> +			global_trace.stop_count = 0;
>  		}
>  		goto out;
>  	}
> @@ -1040,19 +1202,52 @@ void tracing_start(void)
>  	/* Prevent the buffers from switching */
>  	arch_spin_lock(&ftrace_max_lock);
>  
> -	buffer = global_trace.buffer;
> +	buffer = global_trace.trace_buffer.buffer;
>  	if (buffer)
>  		ring_buffer_record_enable(buffer);
>  
> -	buffer = max_tr.buffer;
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	buffer = global_trace.max_buffer.buffer;
>  	if (buffer)
>  		ring_buffer_record_enable(buffer);
> +#endif
>  
>  	arch_spin_unlock(&ftrace_max_lock);
>  
>  	ftrace_start();
>   out:
> -	raw_spin_unlock_irqrestore(&tracing_start_lock, flags);
> +	raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
> +}
> +
> +static void tracing_start_tr(struct trace_array *tr)
> +{
> +	struct ring_buffer *buffer;
> +	unsigned long flags;
> +
> +	if (tracing_disabled)
> +		return;
> +
> +	/* If global, we need to also start the max tracer */
> +	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> +		return tracing_start();
> +
> +	raw_spin_lock_irqsave(&tr->start_lock, flags);
> +
> +	if (--tr->stop_count) {
> +		if (tr->stop_count < 0) {
> +			/* Someone screwed up their debugging */
> +			WARN_ON_ONCE(1);
> +			tr->stop_count = 0;
> +		}
> +		goto out;
> +	}
> +
> +	buffer = tr->trace_buffer.buffer;
> +	if (buffer)
> +		ring_buffer_record_enable(buffer);
> +
> + out:
> +	raw_spin_unlock_irqrestore(&tr->start_lock, flags);
>  }
>  
>  /**
> @@ -1067,25 +1262,48 @@ void tracing_stop(void)
>  	unsigned long flags;
>  
>  	ftrace_stop();
> -	raw_spin_lock_irqsave(&tracing_start_lock, flags);
> -	if (trace_stop_count++)
> +	raw_spin_lock_irqsave(&global_trace.start_lock, flags);
> +	if (global_trace.stop_count++)
>  		goto out;
>  
>  	/* Prevent the buffers from switching */
>  	arch_spin_lock(&ftrace_max_lock);
>  
> -	buffer = global_trace.buffer;
> +	buffer = global_trace.trace_buffer.buffer;
>  	if (buffer)
>  		ring_buffer_record_disable(buffer);
>  
> -	buffer = max_tr.buffer;
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	buffer = global_trace.max_buffer.buffer;
>  	if (buffer)
>  		ring_buffer_record_disable(buffer);
> +#endif
>  
>  	arch_spin_unlock(&ftrace_max_lock);
>  
>   out:
> -	raw_spin_unlock_irqrestore(&tracing_start_lock, flags);
> +	raw_spin_unlock_irqrestore(&global_trace.start_lock, flags);
> +}
> +
> +static void tracing_stop_tr(struct trace_array *tr)
> +{
> +	struct ring_buffer *buffer;
> +	unsigned long flags;
> +
> +	/* If global, we need to also stop the max tracer */
> +	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> +		return tracing_stop();
> +
> +	raw_spin_lock_irqsave(&tr->start_lock, flags);
> +	if (tr->stop_count++)
> +		goto out;
> +
> +	buffer = tr->trace_buffer.buffer;
> +	if (buffer)
> +		ring_buffer_record_disable(buffer);
> +
> + out:
> +	raw_spin_unlock_irqrestore(&tr->start_lock, flags);
>  }
>  
>  void trace_stop_cmdline_recording(void);
> @@ -1218,11 +1436,6 @@ void
>  __buffer_unlock_commit(struct ring_buffer *buffer, struct ring_buffer_event *event)
>  {
>  	__this_cpu_write(trace_cmdline_save, true);
> -	if (trace_wakeup_needed) {
> -		trace_wakeup_needed = false;
> -		/* irq_work_queue() supplies it's own memory barriers */
> -		irq_work_queue(&trace_work_wakeup);
> -	}
>  	ring_buffer_unlock_commit(buffer, event);
>  }
>  
> @@ -1246,11 +1459,23 @@ void trace_buffer_unlock_commit(struct ring_buffer *buffer,
>  EXPORT_SYMBOL_GPL(trace_buffer_unlock_commit);
>  
>  struct ring_buffer_event *
> +trace_event_buffer_lock_reserve(struct ring_buffer **current_rb,
> +			  struct ftrace_event_file *ftrace_file,
> +			  int type, unsigned long len,
> +			  unsigned long flags, int pc)
> +{
> +	*current_rb = ftrace_file->tr->trace_buffer.buffer;
> +	return trace_buffer_lock_reserve(*current_rb,
> +					 type, len, flags, pc);
> +}
> +EXPORT_SYMBOL_GPL(trace_event_buffer_lock_reserve);
> +
> +struct ring_buffer_event *
>  trace_current_buffer_lock_reserve(struct ring_buffer **current_rb,
>  				  int type, unsigned long len,
>  				  unsigned long flags, int pc)
>  {
> -	*current_rb = global_trace.buffer;
> +	*current_rb = global_trace.trace_buffer.buffer;
>  	return trace_buffer_lock_reserve(*current_rb,
>  					 type, len, flags, pc);
>  }
> @@ -1289,7 +1514,7 @@ trace_function(struct trace_array *tr,
>  	       int pc)
>  {
>  	struct ftrace_event_call *call = &event_function;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ring_buffer_event *event;
>  	struct ftrace_entry *entry;
>  
> @@ -1430,13 +1655,14 @@ void ftrace_trace_stack(struct ring_buffer *buffer, unsigned long flags,
>  void __trace_stack(struct trace_array *tr, unsigned long flags, int skip,
>  		   int pc)
>  {
> -	__ftrace_trace_stack(tr->buffer, flags, skip, pc, NULL);
> +	__ftrace_trace_stack(tr->trace_buffer.buffer, flags, skip, pc, NULL);
>  }
>  
>  /**
>   * trace_dump_stack - record a stack back trace in the trace buffer
> + * @skip: Number of functions to skip (helper handlers)
>   */
> -void trace_dump_stack(void)
> +void trace_dump_stack(int skip)
>  {
>  	unsigned long flags;
>  
> @@ -1445,8 +1671,13 @@ void trace_dump_stack(void)
>  
>  	local_save_flags(flags);
>  
> -	/* skipping 3 traces, seems to get us at the caller of this function */
> -	__ftrace_trace_stack(global_trace.buffer, flags, 3, preempt_count(), NULL);
> +	/*
> +	 * Skip 3 more, seems to get us at the caller of
> +	 * this function.
> +	 */
> +	skip += 3;
> +	__ftrace_trace_stack(global_trace.trace_buffer.buffer,
> +			     flags, skip, preempt_count(), NULL);
>  }
>  
>  static DEFINE_PER_CPU(int, user_stack_count);
> @@ -1616,7 +1847,7 @@ void trace_printk_init_buffers(void)
>  	 * directly here. If the global_trace.buffer is already
>  	 * allocated here, then this was called by module code.
>  	 */
> -	if (global_trace.buffer)
> +	if (global_trace.trace_buffer.buffer)
>  		tracing_start_cmdline_record();
>  }
>  
> @@ -1676,7 +1907,7 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
>  
>  	local_save_flags(flags);
>  	size = sizeof(*entry) + sizeof(u32) * len;
> -	buffer = tr->buffer;
> +	buffer = tr->trace_buffer.buffer;
>  	event = trace_buffer_lock_reserve(buffer, TRACE_BPRINT, size,
>  					  flags, pc);
>  	if (!event)
> @@ -1699,27 +1930,12 @@ out:
>  }
>  EXPORT_SYMBOL_GPL(trace_vbprintk);
>  
> -int trace_array_printk(struct trace_array *tr,
> -		       unsigned long ip, const char *fmt, ...)
> -{
> -	int ret;
> -	va_list ap;
> -
> -	if (!(trace_flags & TRACE_ITER_PRINTK))
> -		return 0;
> -
> -	va_start(ap, fmt);
> -	ret = trace_array_vprintk(tr, ip, fmt, ap);
> -	va_end(ap);
> -	return ret;
> -}
> -
> -int trace_array_vprintk(struct trace_array *tr,
> -			unsigned long ip, const char *fmt, va_list args)
> +static int
> +__trace_array_vprintk(struct ring_buffer *buffer,
> +		      unsigned long ip, const char *fmt, va_list args)
>  {
>  	struct ftrace_event_call *call = &event_print;
>  	struct ring_buffer_event *event;
> -	struct ring_buffer *buffer;
>  	int len = 0, size, pc;
>  	struct print_entry *entry;
>  	unsigned long flags;
> @@ -1747,7 +1963,6 @@ int trace_array_vprintk(struct trace_array *tr,
>  
>  	local_save_flags(flags);
>  	size = sizeof(*entry) + len + 1;
> -	buffer = tr->buffer;
>  	event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
>  					  flags, pc);
>  	if (!event)
> @@ -1768,8 +1983,44 @@ int trace_array_vprintk(struct trace_array *tr,
>  	return len;
>  }
>  
> -int trace_vprintk(unsigned long ip, const char *fmt, va_list args)
> -{
> +int trace_array_vprintk(struct trace_array *tr,
> +			unsigned long ip, const char *fmt, va_list args)
> +{
> +	return __trace_array_vprintk(tr->trace_buffer.buffer, ip, fmt, args);
> +}
> +
> +int trace_array_printk(struct trace_array *tr,
> +		       unsigned long ip, const char *fmt, ...)
> +{
> +	int ret;
> +	va_list ap;
> +
> +	if (!(trace_flags & TRACE_ITER_PRINTK))
> +		return 0;
> +
> +	va_start(ap, fmt);
> +	ret = trace_array_vprintk(tr, ip, fmt, ap);
> +	va_end(ap);
> +	return ret;
> +}
> +
> +int trace_array_printk_buf(struct ring_buffer *buffer,
> +			   unsigned long ip, const char *fmt, ...)
> +{
> +	int ret;
> +	va_list ap;
> +
> +	if (!(trace_flags & TRACE_ITER_PRINTK))
> +		return 0;
> +
> +	va_start(ap, fmt);
> +	ret = __trace_array_vprintk(buffer, ip, fmt, ap);
> +	va_end(ap);
> +	return ret;
> +}
> +
> +int trace_vprintk(unsigned long ip, const char *fmt, va_list args)
> +{
>  	return trace_array_vprintk(&global_trace, ip, fmt, args);
>  }
>  EXPORT_SYMBOL_GPL(trace_vprintk);
> @@ -1793,7 +2044,7 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts,
>  	if (buf_iter)
>  		event = ring_buffer_iter_peek(buf_iter, ts);
>  	else
> -		event = ring_buffer_peek(iter->tr->buffer, cpu, ts,
> +		event = ring_buffer_peek(iter->trace_buffer->buffer, cpu, ts,
>  					 lost_events);
>  
>  	if (event) {
> @@ -1808,7 +2059,7 @@ static struct trace_entry *
>  __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
>  		  unsigned long *missing_events, u64 *ent_ts)
>  {
> -	struct ring_buffer *buffer = iter->tr->buffer;
> +	struct ring_buffer *buffer = iter->trace_buffer->buffer;
>  	struct trace_entry *ent, *next = NULL;
>  	unsigned long lost_events = 0, next_lost = 0;
>  	int cpu_file = iter->cpu_file;
> @@ -1821,7 +2072,7 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu,
>  	 * If we are in a per_cpu trace file, don't bother by iterating over
>  	 * all cpu and peek directly.
>  	 */
> -	if (cpu_file > TRACE_PIPE_ALL_CPU) {
> +	if (cpu_file > RING_BUFFER_ALL_CPUS) {
>  		if (ring_buffer_empty_cpu(buffer, cpu_file))
>  			return NULL;
>  		ent = peek_next_entry(iter, cpu_file, ent_ts, missing_events);
> @@ -1885,7 +2136,7 @@ void *trace_find_next_entry_inc(struct trace_iterator *iter)
>  
>  static void trace_consume(struct trace_iterator *iter)
>  {
> -	ring_buffer_consume(iter->tr->buffer, iter->cpu, &iter->ts,
> +	ring_buffer_consume(iter->trace_buffer->buffer, iter->cpu, &iter->ts,
>  			    &iter->lost_events);
>  }
>  
> @@ -1918,13 +2169,12 @@ static void *s_next(struct seq_file *m, void *v, loff_t *pos)
>  
>  void tracing_iter_reset(struct trace_iterator *iter, int cpu)
>  {
> -	struct trace_array *tr = iter->tr;
>  	struct ring_buffer_event *event;
>  	struct ring_buffer_iter *buf_iter;
>  	unsigned long entries = 0;
>  	u64 ts;
>  
> -	tr->data[cpu]->skipped_entries = 0;
> +	per_cpu_ptr(iter->trace_buffer->data, cpu)->skipped_entries = 0;
>  
>  	buf_iter = trace_buffer_iter(iter, cpu);
>  	if (!buf_iter)
> @@ -1938,13 +2188,13 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
>  	 * by the timestamp being before the start of the buffer.
>  	 */
>  	while ((event = ring_buffer_iter_peek(buf_iter, &ts))) {
> -		if (ts >= iter->tr->time_start)
> +		if (ts >= iter->trace_buffer->time_start)
>  			break;
>  		entries++;
>  		ring_buffer_read(buf_iter, NULL);
>  	}
>  
> -	tr->data[cpu]->skipped_entries = entries;
> +	per_cpu_ptr(iter->trace_buffer->data, cpu)->skipped_entries = entries;
>  }
>  
>  /*
> @@ -1954,6 +2204,7 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu)
>  static void *s_start(struct seq_file *m, loff_t *pos)
>  {
>  	struct trace_iterator *iter = m->private;
> +	struct trace_array *tr = iter->tr;
>  	int cpu_file = iter->cpu_file;
>  	void *p = NULL;
>  	loff_t l = 0;
> @@ -1966,12 +2217,14 @@ static void *s_start(struct seq_file *m, loff_t *pos)
>  	 * will point to the same string as current_trace->name.
>  	 */
>  	mutex_lock(&trace_types_lock);
> -	if (unlikely(current_trace && iter->trace->name != current_trace->name))
> -		*iter->trace = *current_trace;
> +	if (unlikely(tr->current_trace && iter->trace->name != tr->current_trace->name))
> +		*iter->trace = *tr->current_trace;
>  	mutex_unlock(&trace_types_lock);
>  
> +#ifdef CONFIG_TRACER_MAX_TRACE
>  	if (iter->snapshot && iter->trace->use_max_tr)
>  		return ERR_PTR(-EBUSY);
> +#endif
>  
>  	if (!iter->snapshot)
>  		atomic_inc(&trace_record_cmdline_disabled);
> @@ -1981,7 +2234,7 @@ static void *s_start(struct seq_file *m, loff_t *pos)
>  		iter->cpu = 0;
>  		iter->idx = -1;
>  
> -		if (cpu_file == TRACE_PIPE_ALL_CPU) {
> +		if (cpu_file == RING_BUFFER_ALL_CPUS) {
>  			for_each_tracing_cpu(cpu)
>  				tracing_iter_reset(iter, cpu);
>  		} else
> @@ -2013,17 +2266,21 @@ static void s_stop(struct seq_file *m, void *p)
>  {
>  	struct trace_iterator *iter = m->private;
>  
> +#ifdef CONFIG_TRACER_MAX_TRACE
>  	if (iter->snapshot && iter->trace->use_max_tr)
>  		return;
> +#endif
>  
>  	if (!iter->snapshot)
>  		atomic_dec(&trace_record_cmdline_disabled);
> +
>  	trace_access_unlock(iter->cpu_file);
>  	trace_event_read_unlock();
>  }
>  
>  static void
> -get_total_entries(struct trace_array *tr, unsigned long *total, unsigned long *entries)
> +get_total_entries(struct trace_buffer *buf,
> +		  unsigned long *total, unsigned long *entries)
>  {
>  	unsigned long count;
>  	int cpu;
> @@ -2032,19 +2289,19 @@ get_total_entries(struct trace_array *tr, unsigned long *total, unsigned long *e
>  	*entries = 0;
>  
>  	for_each_tracing_cpu(cpu) {
> -		count = ring_buffer_entries_cpu(tr->buffer, cpu);
> +		count = ring_buffer_entries_cpu(buf->buffer, cpu);
>  		/*
>  		 * If this buffer has skipped entries, then we hold all
>  		 * entries for the trace and we need to ignore the
>  		 * ones before the time stamp.
>  		 */
> -		if (tr->data[cpu]->skipped_entries) {
> -			count -= tr->data[cpu]->skipped_entries;
> +		if (per_cpu_ptr(buf->data, cpu)->skipped_entries) {
> +			count -= per_cpu_ptr(buf->data, cpu)->skipped_entries;
>  			/* total is the same as the entries */
>  			*total += count;
>  		} else
>  			*total += count +
> -				ring_buffer_overrun_cpu(tr->buffer, cpu);
> +				ring_buffer_overrun_cpu(buf->buffer, cpu);
>  		*entries += count;
>  	}
>  }
> @@ -2061,27 +2318,27 @@ static void print_lat_help_header(struct seq_file *m)
>  	seq_puts(m, "#     \\   /      |||||  \\    |   /           \n");
>  }
>  
> -static void print_event_info(struct trace_array *tr, struct seq_file *m)
> +static void print_event_info(struct trace_buffer *buf, struct seq_file *m)
>  {
>  	unsigned long total;
>  	unsigned long entries;
>  
> -	get_total_entries(tr, &total, &entries);
> +	get_total_entries(buf, &total, &entries);
>  	seq_printf(m, "# entries-in-buffer/entries-written: %lu/%lu   #P:%d\n",
>  		   entries, total, num_online_cpus());
>  	seq_puts(m, "#\n");
>  }
>  
> -static void print_func_help_header(struct trace_array *tr, struct seq_file *m)
> +static void print_func_help_header(struct trace_buffer *buf, struct seq_file *m)
>  {
> -	print_event_info(tr, m);
> +	print_event_info(buf, m);
>  	seq_puts(m, "#           TASK-PID   CPU#      TIMESTAMP  FUNCTION\n");
>  	seq_puts(m, "#              | |       |          |         |\n");
>  }
>  
> -static void print_func_help_header_irq(struct trace_array *tr, struct seq_file *m)
> +static void print_func_help_header_irq(struct trace_buffer *buf, struct seq_file *m)
>  {
> -	print_event_info(tr, m);
> +	print_event_info(buf, m);
>  	seq_puts(m, "#                              _-----=> irqs-off\n");
>  	seq_puts(m, "#                             / _----=> need-resched\n");
>  	seq_puts(m, "#                            | / _---=> hardirq/softirq\n");
> @@ -2095,16 +2352,16 @@ void
>  print_trace_header(struct seq_file *m, struct trace_iterator *iter)
>  {
>  	unsigned long sym_flags = (trace_flags & TRACE_ITER_SYM_MASK);
> -	struct trace_array *tr = iter->tr;
> -	struct trace_array_cpu *data = tr->data[tr->cpu];
> -	struct tracer *type = current_trace;
> +	struct trace_buffer *buf = iter->trace_buffer;
> +	struct trace_array_cpu *data = per_cpu_ptr(buf->data, buf->cpu);
> +	struct tracer *type = iter->trace;
>  	unsigned long entries;
>  	unsigned long total;
>  	const char *name = "preemption";
>  
>  	name = type->name;
>  
> -	get_total_entries(tr, &total, &entries);
> +	get_total_entries(buf, &total, &entries);
>  
>  	seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
>  		   name, UTS_RELEASE);
> @@ -2115,7 +2372,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter)
>  		   nsecs_to_usecs(data->saved_latency),
>  		   entries,
>  		   total,
> -		   tr->cpu,
> +		   buf->cpu,
>  #if defined(CONFIG_PREEMPT_NONE)
>  		   "server",
>  #elif defined(CONFIG_PREEMPT_VOLUNTARY)
> @@ -2166,7 +2423,7 @@ static void test_cpu_buff_start(struct trace_iterator *iter)
>  	if (cpumask_test_cpu(iter->cpu, iter->started))
>  		return;
>  
> -	if (iter->tr->data[iter->cpu]->skipped_entries)
> +	if (per_cpu_ptr(iter->trace_buffer->data, iter->cpu)->skipped_entries)
>  		return;
>  
>  	cpumask_set_cpu(iter->cpu, iter->started);
> @@ -2289,14 +2546,14 @@ int trace_empty(struct trace_iterator *iter)
>  	int cpu;
>  
>  	/* If we are looking at one CPU buffer, only check that one */
> -	if (iter->cpu_file != TRACE_PIPE_ALL_CPU) {
> +	if (iter->cpu_file != RING_BUFFER_ALL_CPUS) {
>  		cpu = iter->cpu_file;
>  		buf_iter = trace_buffer_iter(iter, cpu);
>  		if (buf_iter) {
>  			if (!ring_buffer_iter_empty(buf_iter))
>  				return 0;
>  		} else {
> -			if (!ring_buffer_empty_cpu(iter->tr->buffer, cpu))
> +			if (!ring_buffer_empty_cpu(iter->trace_buffer->buffer, cpu))
>  				return 0;
>  		}
>  		return 1;
> @@ -2308,7 +2565,7 @@ int trace_empty(struct trace_iterator *iter)
>  			if (!ring_buffer_iter_empty(buf_iter))
>  				return 0;
>  		} else {
> -			if (!ring_buffer_empty_cpu(iter->tr->buffer, cpu))
> +			if (!ring_buffer_empty_cpu(iter->trace_buffer->buffer, cpu))
>  				return 0;
>  		}
>  	}
> @@ -2332,6 +2589,11 @@ enum print_line_t print_trace_line(struct trace_iterator *iter)
>  			return ret;
>  	}
>  
> +	if (iter->ent->type == TRACE_BPUTS &&
> +			trace_flags & TRACE_ITER_PRINTK &&
> +			trace_flags & TRACE_ITER_PRINTK_MSGONLY)
> +		return trace_print_bputs_msg_only(iter);
> +
>  	if (iter->ent->type == TRACE_BPRINT &&
>  			trace_flags & TRACE_ITER_PRINTK &&
>  			trace_flags & TRACE_ITER_PRINTK_MSGONLY)
> @@ -2386,9 +2648,9 @@ void trace_default_header(struct seq_file *m)
>  	} else {
>  		if (!(trace_flags & TRACE_ITER_VERBOSE)) {
>  			if (trace_flags & TRACE_ITER_IRQ_INFO)
> -				print_func_help_header_irq(iter->tr, m);
> +				print_func_help_header_irq(iter->trace_buffer, m);
>  			else
> -				print_func_help_header(iter->tr, m);
> +				print_func_help_header(iter->trace_buffer, m);
>  		}
>  	}
>  }
> @@ -2402,14 +2664,8 @@ static void test_ftrace_alive(struct seq_file *m)
>  }
>  
>  #ifdef CONFIG_TRACER_MAX_TRACE
> -static void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
> +static void show_snapshot_main_help(struct seq_file *m)
>  {
> -	if (iter->trace->allocated_snapshot)
> -		seq_printf(m, "#\n# * Snapshot is allocated *\n#\n");
> -	else
> -		seq_printf(m, "#\n# * Snapshot is freed *\n#\n");
> -
> -	seq_printf(m, "# Snapshot commands:\n");
>  	seq_printf(m, "# echo 0 > snapshot : Clears and frees snapshot buffer\n");
>  	seq_printf(m, "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n");
>  	seq_printf(m, "#                      Takes a snapshot of the main buffer.\n");
> @@ -2417,6 +2673,35 @@ static void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
>  	seq_printf(m, "#                      (Doesn't have to be '2' works with any number that\n");
>  	seq_printf(m, "#                       is not a '0' or '1')\n");
>  }
> +
> +static void show_snapshot_percpu_help(struct seq_file *m)
> +{
> +	seq_printf(m, "# echo 0 > snapshot : Invalid for per_cpu snapshot file.\n");
> +#ifdef CONFIG_RING_BUFFER_ALLOW_SWAP
> +	seq_printf(m, "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n");
> +	seq_printf(m, "#                      Takes a snapshot of the main buffer for this cpu.\n");
> +#else
> +	seq_printf(m, "# echo 1 > snapshot : Not supported with this kernel.\n");
> +	seq_printf(m, "#                     Must use main snapshot file to allocate.\n");
> +#endif
> +	seq_printf(m, "# echo 2 > snapshot : Clears this cpu's snapshot buffer (but does not allocate)\n");
> +	seq_printf(m, "#                      (Doesn't have to be '2' works with any number that\n");
> +	seq_printf(m, "#                       is not a '0' or '1')\n");
> +}
> +
> +static void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
> +{
> +	if (iter->tr->allocated_snapshot)
> +		seq_printf(m, "#\n# * Snapshot is allocated *\n#\n");
> +	else
> +		seq_printf(m, "#\n# * Snapshot is freed *\n#\n");
> +
> +	seq_printf(m, "# Snapshot commands:\n");
> +	if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
> +		show_snapshot_main_help(m);
> +	else
> +		show_snapshot_percpu_help(m);
> +}
>  #else
>  /* Should never be called */
>  static inline void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter) { }
> @@ -2476,7 +2761,8 @@ static const struct seq_operations tracer_seq_ops = {
>  static struct trace_iterator *
>  __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>  {
> -	long cpu_file = (long) inode->i_private;
> +	struct trace_cpu *tc = inode->i_private;
> +	struct trace_array *tr = tc->tr;
>  	struct trace_iterator *iter;
>  	int cpu;
>  
> @@ -2501,26 +2787,31 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>  	if (!iter->trace)
>  		goto fail;
>  
> -	*iter->trace = *current_trace;
> +	*iter->trace = *tr->current_trace;
>  
>  	if (!zalloc_cpumask_var(&iter->started, GFP_KERNEL))
>  		goto fail;
>  
> -	if (current_trace->print_max || snapshot)
> -		iter->tr = &max_tr;
> +	iter->tr = tr;
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	/* Currently only the top directory has a snapshot */
> +	if (tr->current_trace->print_max || snapshot)
> +		iter->trace_buffer = &tr->max_buffer;
>  	else
> -		iter->tr = &global_trace;
> +#endif
> +		iter->trace_buffer = &tr->trace_buffer;
>  	iter->snapshot = snapshot;
>  	iter->pos = -1;
>  	mutex_init(&iter->mutex);
> -	iter->cpu_file = cpu_file;
> +	iter->cpu_file = tc->cpu;
>  
>  	/* Notify the tracer early; before we stop tracing. */
>  	if (iter->trace && iter->trace->open)
>  		iter->trace->open(iter);
>  
>  	/* Annotate start of buffers if we had overruns */
> -	if (ring_buffer_overruns(iter->tr->buffer))
> +	if (ring_buffer_overruns(iter->trace_buffer->buffer))
>  		iter->iter_flags |= TRACE_FILE_ANNOTATE;
>  
>  	/* Output in nanoseconds only if we are using a clock in nanoseconds. */
> @@ -2529,12 +2820,12 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>  
>  	/* stop the trace while dumping if we are not opening "snapshot" */
>  	if (!iter->snapshot)
> -		tracing_stop();
> +		tracing_stop_tr(tr);
>  
> -	if (iter->cpu_file == TRACE_PIPE_ALL_CPU) {
> +	if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
>  		for_each_tracing_cpu(cpu) {
>  			iter->buffer_iter[cpu] =
> -				ring_buffer_read_prepare(iter->tr->buffer, cpu);
> +				ring_buffer_read_prepare(iter->trace_buffer->buffer, cpu);
>  		}
>  		ring_buffer_read_prepare_sync();
>  		for_each_tracing_cpu(cpu) {
> @@ -2544,12 +2835,14 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot)
>  	} else {
>  		cpu = iter->cpu_file;
>  		iter->buffer_iter[cpu] =
> -			ring_buffer_read_prepare(iter->tr->buffer, cpu);
> +			ring_buffer_read_prepare(iter->trace_buffer->buffer, cpu);
>  		ring_buffer_read_prepare_sync();
>  		ring_buffer_read_start(iter->buffer_iter[cpu]);
>  		tracing_iter_reset(iter, cpu);
>  	}
>  
> +	tr->ref++;
> +
>  	mutex_unlock(&trace_types_lock);
>  
>  	return iter;
> @@ -2576,14 +2869,20 @@ static int tracing_release(struct inode *inode, struct file *file)
>  {
>  	struct seq_file *m = file->private_data;
>  	struct trace_iterator *iter;
> +	struct trace_array *tr;
>  	int cpu;
>  
>  	if (!(file->f_mode & FMODE_READ))
>  		return 0;
>  
>  	iter = m->private;
> +	tr = iter->tr;
>  
>  	mutex_lock(&trace_types_lock);
> +
> +	WARN_ON(!tr->ref);
> +	tr->ref--;
> +
>  	for_each_tracing_cpu(cpu) {
>  		if (iter->buffer_iter[cpu])
>  			ring_buffer_read_finish(iter->buffer_iter[cpu]);
> @@ -2594,7 +2893,7 @@ static int tracing_release(struct inode *inode, struct file *file)
>  
>  	if (!iter->snapshot)
>  		/* reenable tracing if it was previously enabled */
> -		tracing_start();
> +		tracing_start_tr(tr);
>  	mutex_unlock(&trace_types_lock);
>  
>  	mutex_destroy(&iter->mutex);
> @@ -2613,12 +2912,13 @@ static int tracing_open(struct inode *inode, struct file *file)
>  	/* If this file was open for write, then erase contents */
>  	if ((file->f_mode & FMODE_WRITE) &&
>  	    (file->f_flags & O_TRUNC)) {
> -		long cpu = (long) inode->i_private;
> +		struct trace_cpu *tc = inode->i_private;
> +		struct trace_array *tr = tc->tr;
>  
> -		if (cpu == TRACE_PIPE_ALL_CPU)
> -			tracing_reset_online_cpus(&global_trace);
> +		if (tc->cpu == RING_BUFFER_ALL_CPUS)
> +			tracing_reset_online_cpus(&tr->trace_buffer);
>  		else
> -			tracing_reset(&global_trace, cpu);
> +			tracing_reset(&tr->trace_buffer, tc->cpu);
>  	}
>  
>  	if (file->f_mode & FMODE_READ) {
> @@ -2765,8 +3065,9 @@ static ssize_t
>  tracing_cpumask_write(struct file *filp, const char __user *ubuf,
>  		      size_t count, loff_t *ppos)
>  {
> -	int err, cpu;
> +	struct trace_array *tr = filp->private_data;
>  	cpumask_var_t tracing_cpumask_new;
> +	int err, cpu;
>  
>  	if (!alloc_cpumask_var(&tracing_cpumask_new, GFP_KERNEL))
>  		return -ENOMEM;
> @@ -2786,13 +3087,13 @@ tracing_cpumask_write(struct file *filp, const char __user *ubuf,
>  		 */
>  		if (cpumask_test_cpu(cpu, tracing_cpumask) &&
>  				!cpumask_test_cpu(cpu, tracing_cpumask_new)) {
> -			atomic_inc(&global_trace.data[cpu]->disabled);
> -			ring_buffer_record_disable_cpu(global_trace.buffer, cpu);
> +			atomic_inc(&per_cpu_ptr(tr->trace_buffer.data, cpu)->disabled);
> +			ring_buffer_record_disable_cpu(tr->trace_buffer.buffer, cpu);
>  		}
>  		if (!cpumask_test_cpu(cpu, tracing_cpumask) &&
>  				cpumask_test_cpu(cpu, tracing_cpumask_new)) {
> -			atomic_dec(&global_trace.data[cpu]->disabled);
> -			ring_buffer_record_enable_cpu(global_trace.buffer, cpu);
> +			atomic_dec(&per_cpu_ptr(tr->trace_buffer.data, cpu)->disabled);
> +			ring_buffer_record_enable_cpu(tr->trace_buffer.buffer, cpu);
>  		}
>  	}
>  	arch_spin_unlock(&ftrace_max_lock);
> @@ -2821,12 +3122,13 @@ static const struct file_operations tracing_cpumask_fops = {
>  static int tracing_trace_options_show(struct seq_file *m, void *v)
>  {
>  	struct tracer_opt *trace_opts;
> +	struct trace_array *tr = m->private;
>  	u32 tracer_flags;
>  	int i;
>  
>  	mutex_lock(&trace_types_lock);
> -	tracer_flags = current_trace->flags->val;
> -	trace_opts = current_trace->flags->opts;
> +	tracer_flags = tr->current_trace->flags->val;
> +	trace_opts = tr->current_trace->flags->opts;
>  
>  	for (i = 0; trace_options[i]; i++) {
>  		if (trace_flags & (1 << i))
> @@ -2890,15 +3192,15 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set)
>  	return 0;
>  }
>  
> -int set_tracer_flag(unsigned int mask, int enabled)
> +int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
>  {
>  	/* do nothing if flag is already set */
>  	if (!!(trace_flags & mask) == !!enabled)
>  		return 0;
>  
>  	/* Give the tracer a chance to approve the change */
> -	if (current_trace->flag_changed)
> -		if (current_trace->flag_changed(current_trace, mask, !!enabled))
> +	if (tr->current_trace->flag_changed)
> +		if (tr->current_trace->flag_changed(tr->current_trace, mask, !!enabled))
>  			return -EINVAL;
>  
>  	if (enabled)
> @@ -2910,9 +3212,9 @@ int set_tracer_flag(unsigned int mask, int enabled)
>  		trace_event_enable_cmd_record(enabled);
>  
>  	if (mask == TRACE_ITER_OVERWRITE) {
> -		ring_buffer_change_overwrite(global_trace.buffer, enabled);
> +		ring_buffer_change_overwrite(tr->trace_buffer.buffer, enabled);
>  #ifdef CONFIG_TRACER_MAX_TRACE
> -		ring_buffer_change_overwrite(max_tr.buffer, enabled);
> +		ring_buffer_change_overwrite(tr->max_buffer.buffer, enabled);
>  #endif
>  	}
>  
> @@ -2922,7 +3224,7 @@ int set_tracer_flag(unsigned int mask, int enabled)
>  	return 0;
>  }
>  
> -static int trace_set_options(char *option)
> +static int trace_set_options(struct trace_array *tr, char *option)
>  {
>  	char *cmp;
>  	int neg = 0;
> @@ -2940,14 +3242,14 @@ static int trace_set_options(char *option)
>  
>  	for (i = 0; trace_options[i]; i++) {
>  		if (strcmp(cmp, trace_options[i]) == 0) {
> -			ret = set_tracer_flag(1 << i, !neg);
> +			ret = set_tracer_flag(tr, 1 << i, !neg);
>  			break;
>  		}
>  	}
>  
>  	/* If no option could be set, test the specific tracer options */
>  	if (!trace_options[i])
> -		ret = set_tracer_option(current_trace, cmp, neg);
> +		ret = set_tracer_option(tr->current_trace, cmp, neg);
>  
>  	mutex_unlock(&trace_types_lock);
>  
> @@ -2958,6 +3260,8 @@ static ssize_t
>  tracing_trace_options_write(struct file *filp, const char __user *ubuf,
>  			size_t cnt, loff_t *ppos)
>  {
> +	struct seq_file *m = filp->private_data;
> +	struct trace_array *tr = m->private;
>  	char buf[64];
>  	int ret;
>  
> @@ -2969,7 +3273,7 @@ tracing_trace_options_write(struct file *filp, const char __user *ubuf,
>  
>  	buf[cnt] = 0;
>  
> -	ret = trace_set_options(buf);
> +	ret = trace_set_options(tr, buf);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -2982,7 +3286,8 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file)
>  {
>  	if (tracing_disabled)
>  		return -ENODEV;
> -	return single_open(file, tracing_trace_options_show, NULL);
> +
> +	return single_open(file, tracing_trace_options_show, inode->i_private);
>  }
>  
>  static const struct file_operations tracing_iter_fops = {
> @@ -2995,20 +3300,84 @@ static const struct file_operations tracing_iter_fops = {
>  
>  static const char readme_msg[] =
>  	"tracing mini-HOWTO:\n\n"
> -	"# mount -t debugfs nodev /sys/kernel/debug\n\n"
> -	"# cat /sys/kernel/debug/tracing/available_tracers\n"
> -	"wakeup wakeup_rt preemptirqsoff preemptoff irqsoff function nop\n\n"
> -	"# cat /sys/kernel/debug/tracing/current_tracer\n"
> -	"nop\n"
> -	"# echo wakeup > /sys/kernel/debug/tracing/current_tracer\n"
> -	"# cat /sys/kernel/debug/tracing/current_tracer\n"
> -	"wakeup\n"
> -	"# cat /sys/kernel/debug/tracing/trace_options\n"
> -	"noprint-parent nosym-offset nosym-addr noverbose\n"
> -	"# echo print-parent > /sys/kernel/debug/tracing/trace_options\n"
> -	"# echo 1 > /sys/kernel/debug/tracing/tracing_on\n"
> -	"# cat /sys/kernel/debug/tracing/trace > /tmp/trace.txt\n"
> -	"# echo 0 > /sys/kernel/debug/tracing/tracing_on\n"
> +	"# echo 0 > tracing_on : quick way to disable tracing\n"
> +	"# echo 1 > tracing_on : quick way to re-enable tracing\n\n"
> +	" Important files:\n"
> +	"  trace\t\t\t- The static contents of the buffer\n"
> +	"\t\t\t  To clear the buffer write into this file: echo > trace\n"
> +	"  trace_pipe\t\t- A consuming read to see the contents of the buffer\n"
> +	"  current_tracer\t- function and latency tracers\n"
> +	"  available_tracers\t- list of configured tracers for current_tracer\n"
> +	"  buffer_size_kb\t- view and modify size of per cpu buffer\n"
> +	"  buffer_total_size_kb  - view total size of all cpu buffers\n\n"
> +	"  trace_clock\t\t-change the clock used to order events\n"
> +	"       local:   Per cpu clock but may not be synced across CPUs\n"
> +	"      global:   Synced across CPUs but slows tracing down.\n"
> +	"     counter:   Not a clock, but just an increment\n"
> +	"      uptime:   Jiffy counter from time of boot\n"
> +	"        perf:   Same clock that perf events use\n"
> +#ifdef CONFIG_X86_64
> +	"     x86-tsc:   TSC cycle counter\n"
> +#endif
> +	"\n  trace_marker\t\t- Writes into this file writes into the kernel buffer\n"
> +	"  tracing_cpumask\t- Limit which CPUs to trace\n"
> +	"  instances\t\t- Make sub-buffers with: mkdir instances/foo\n"
> +	"\t\t\t  Remove sub-buffer with rmdir\n"
> +	"  trace_options\t\t- Set format or modify how tracing happens\n"
> +	"\t\t\t  Disable an option by adding a suffix 'no' to the option name\n"
> +#ifdef CONFIG_DYNAMIC_FTRACE
> +	"\n  available_filter_functions - list of functions that can be filtered on\n"
> +	"  set_ftrace_filter\t- echo function name in here to only trace these functions\n"
> +	"            accepts: func_full_name, *func_end, func_begin*, *func_middle*\n"
> +	"            modules: Can select a group via module\n"
> +	"             Format: :mod:<module-name>\n"
> +	"             example: echo :mod:ext3 > set_ftrace_filter\n"
> +	"            triggers: a command to perform when function is hit\n"
> +	"              Format: <function>:<trigger>[:count]\n"
> +	"             trigger: traceon, traceoff\n"
> +	"                      enable_event:<system>:<event>\n"
> +	"                      disable_event:<system>:<event>\n"
> +#ifdef CONFIG_STACKTRACE
> +	"                      stacktrace\n"
> +#endif
> +#ifdef CONFIG_TRACER_SNAPSHOT
> +	"                      snapshot\n"
> +#endif
> +	"             example: echo do_fault:traceoff > set_ftrace_filter\n"
> +	"                      echo do_trap:traceoff:3 > set_ftrace_filter\n"
> +	"             The first one will disable tracing every time do_fault is hit\n"
> +	"             The second will disable tracing at most 3 times when do_trap is hit\n"
> +	"               The first time do trap is hit and it disables tracing, the counter\n"
> +	"               will decrement to 2. If tracing is already disabled, the counter\n"
> +	"               will not decrement. It only decrements when the trigger did work\n"
> +	"             To remove trigger without count:\n"
> +	"               echo '!<function>:<trigger> > set_ftrace_filter\n"
> +	"             To remove trigger with a count:\n"
> +	"               echo '!<function>:<trigger>:0 > set_ftrace_filter\n"
> +	"  set_ftrace_notrace\t- echo function name in here to never trace.\n"
> +	"            accepts: func_full_name, *func_end, func_begin*, *func_middle*\n"
> +	"            modules: Can select a group via module command :mod:\n"
> +	"            Does not accept triggers\n"
> +#endif /* CONFIG_DYNAMIC_FTRACE */
> +#ifdef CONFIG_FUNCTION_TRACER
> +	"  set_ftrace_pid\t- Write pid(s) to only function trace those pids (function)\n"
> +#endif
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> +	"  set_graph_function\t- Trace the nested calls of a function (function_graph)\n"
> +	"  max_graph_depth\t- Trace a limited depth of nested calls (0 is unlimited)\n"
> +#endif
> +#ifdef CONFIG_TRACER_SNAPSHOT
> +	"\n  snapshot\t\t- Like 'trace' but shows the content of the static snapshot buffer\n"
> +	"\t\t\t  Read the contents for more information\n"
> +#endif
> +#ifdef CONFIG_STACKTRACE
> +	"  stack_trace\t\t- Shows the max stack trace when active\n"
> +	"  stack_max_size\t- Shows current max stack size that was traced\n"
> +	"\t\t\t  Write into this file to reset the max size (trigger a new trace)\n"
> +#ifdef CONFIG_DYNAMIC_FTRACE
> +	"  stack_trace_filter\t- Like set_ftrace_filter but limits what stack_trace traces\n"
> +#endif
> +#endif /* CONFIG_STACKTRACE */
>  ;
>  
>  static ssize_t
> @@ -3080,11 +3449,12 @@ static ssize_t
>  tracing_set_trace_read(struct file *filp, char __user *ubuf,
>  		       size_t cnt, loff_t *ppos)
>  {
> +	struct trace_array *tr = filp->private_data;
>  	char buf[MAX_TRACER_SIZE+2];
>  	int r;
>  
>  	mutex_lock(&trace_types_lock);
> -	r = sprintf(buf, "%s\n", current_trace->name);
> +	r = sprintf(buf, "%s\n", tr->current_trace->name);
>  	mutex_unlock(&trace_types_lock);
>  
>  	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> @@ -3092,43 +3462,48 @@ tracing_set_trace_read(struct file *filp, char __user *ubuf,
>  
>  int tracer_init(struct tracer *t, struct trace_array *tr)
>  {
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  	return t->init(tr);
>  }
>  
> -static void set_buffer_entries(struct trace_array *tr, unsigned long val)
> +static void set_buffer_entries(struct trace_buffer *buf, unsigned long val)
>  {
>  	int cpu;
> +
>  	for_each_tracing_cpu(cpu)
> -		tr->data[cpu]->entries = val;
> +		per_cpu_ptr(buf->data, cpu)->entries = val;
>  }
>  
> +#ifdef CONFIG_TRACER_MAX_TRACE
>  /* resize @tr's buffer to the size of @size_tr's entries */
> -static int resize_buffer_duplicate_size(struct trace_array *tr,
> -					struct trace_array *size_tr, int cpu_id)
> +static int resize_buffer_duplicate_size(struct trace_buffer *trace_buf,
> +					struct trace_buffer *size_buf, int cpu_id)
>  {
>  	int cpu, ret = 0;
>  
>  	if (cpu_id == RING_BUFFER_ALL_CPUS) {
>  		for_each_tracing_cpu(cpu) {
> -			ret = ring_buffer_resize(tr->buffer,
> -					size_tr->data[cpu]->entries, cpu);
> +			ret = ring_buffer_resize(trace_buf->buffer,
> +				 per_cpu_ptr(size_buf->data, cpu)->entries, cpu);
>  			if (ret < 0)
>  				break;
> -			tr->data[cpu]->entries = size_tr->data[cpu]->entries;
> +			per_cpu_ptr(trace_buf->data, cpu)->entries =
> +				per_cpu_ptr(size_buf->data, cpu)->entries;
>  		}
>  	} else {
> -		ret = ring_buffer_resize(tr->buffer,
> -					size_tr->data[cpu_id]->entries, cpu_id);
> +		ret = ring_buffer_resize(trace_buf->buffer,
> +				 per_cpu_ptr(size_buf->data, cpu_id)->entries, cpu_id);
>  		if (ret == 0)
> -			tr->data[cpu_id]->entries =
> -				size_tr->data[cpu_id]->entries;
> +			per_cpu_ptr(trace_buf->data, cpu_id)->entries =
> +				per_cpu_ptr(size_buf->data, cpu_id)->entries;
>  	}
>  
>  	return ret;
>  }
> +#endif /* CONFIG_TRACER_MAX_TRACE */
>  
> -static int __tracing_resize_ring_buffer(unsigned long size, int cpu)
> +static int __tracing_resize_ring_buffer(struct trace_array *tr,
> +					unsigned long size, int cpu)
>  {
>  	int ret;
>  
> @@ -3137,23 +3512,25 @@ static int __tracing_resize_ring_buffer(unsigned long size, int cpu)
>  	 * we use the size that was given, and we can forget about
>  	 * expanding it later.
>  	 */
> -	ring_buffer_expanded = 1;
> +	ring_buffer_expanded = true;
>  
>  	/* May be called before buffers are initialized */
> -	if (!global_trace.buffer)
> +	if (!tr->trace_buffer.buffer)
>  		return 0;
>  
> -	ret = ring_buffer_resize(global_trace.buffer, size, cpu);
> +	ret = ring_buffer_resize(tr->trace_buffer.buffer, size, cpu);
>  	if (ret < 0)
>  		return ret;
>  
> -	if (!current_trace->use_max_tr)
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (!(tr->flags & TRACE_ARRAY_FL_GLOBAL) ||
> +	    !tr->current_trace->use_max_tr)
>  		goto out;
>  
> -	ret = ring_buffer_resize(max_tr.buffer, size, cpu);
> +	ret = ring_buffer_resize(tr->max_buffer.buffer, size, cpu);
>  	if (ret < 0) {
> -		int r = resize_buffer_duplicate_size(&global_trace,
> -						     &global_trace, cpu);
> +		int r = resize_buffer_duplicate_size(&tr->trace_buffer,
> +						     &tr->trace_buffer, cpu);
>  		if (r < 0) {
>  			/*
>  			 * AARGH! We are left with different
> @@ -3176,20 +3553,23 @@ static int __tracing_resize_ring_buffer(unsigned long size, int cpu)
>  	}
>  
>  	if (cpu == RING_BUFFER_ALL_CPUS)
> -		set_buffer_entries(&max_tr, size);
> +		set_buffer_entries(&tr->max_buffer, size);
>  	else
> -		max_tr.data[cpu]->entries = size;
> +		per_cpu_ptr(tr->max_buffer.data, cpu)->entries = size;
>  
>   out:
> +#endif /* CONFIG_TRACER_MAX_TRACE */
> +
>  	if (cpu == RING_BUFFER_ALL_CPUS)
> -		set_buffer_entries(&global_trace, size);
> +		set_buffer_entries(&tr->trace_buffer, size);
>  	else
> -		global_trace.data[cpu]->entries = size;
> +		per_cpu_ptr(tr->trace_buffer.data, cpu)->entries = size;
>  
>  	return ret;
>  }
>  
> -static ssize_t tracing_resize_ring_buffer(unsigned long size, int cpu_id)
> +static ssize_t tracing_resize_ring_buffer(struct trace_array *tr,
> +					  unsigned long size, int cpu_id)
>  {
>  	int ret = size;
>  
> @@ -3203,7 +3583,7 @@ static ssize_t tracing_resize_ring_buffer(unsigned long size, int cpu_id)
>  		}
>  	}
>  
> -	ret = __tracing_resize_ring_buffer(size, cpu_id);
> +	ret = __tracing_resize_ring_buffer(tr, size, cpu_id);
>  	if (ret < 0)
>  		ret = -ENOMEM;
>  
> @@ -3230,7 +3610,7 @@ int tracing_update_buffers(void)
>  
>  	mutex_lock(&trace_types_lock);
>  	if (!ring_buffer_expanded)
> -		ret = __tracing_resize_ring_buffer(trace_buf_size,
> +		ret = __tracing_resize_ring_buffer(&global_trace, trace_buf_size,
>  						RING_BUFFER_ALL_CPUS);
>  	mutex_unlock(&trace_types_lock);
>  
> @@ -3240,7 +3620,7 @@ int tracing_update_buffers(void)
>  struct trace_option_dentry;
>  
>  static struct trace_option_dentry *
> -create_trace_option_files(struct tracer *tracer);
> +create_trace_option_files(struct trace_array *tr, struct tracer *tracer);
>  
>  static void
>  destroy_trace_option_files(struct trace_option_dentry *topts);
> @@ -3250,13 +3630,15 @@ static int tracing_set_tracer(const char *buf)
>  	static struct trace_option_dentry *topts;
>  	struct trace_array *tr = &global_trace;
>  	struct tracer *t;
> +#ifdef CONFIG_TRACER_MAX_TRACE
>  	bool had_max_tr;
> +#endif
>  	int ret = 0;
>  
>  	mutex_lock(&trace_types_lock);
>  
>  	if (!ring_buffer_expanded) {
> -		ret = __tracing_resize_ring_buffer(trace_buf_size,
> +		ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
>  						RING_BUFFER_ALL_CPUS);
>  		if (ret < 0)
>  			goto out;
> @@ -3271,18 +3653,21 @@ static int tracing_set_tracer(const char *buf)
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -	if (t == current_trace)
> +	if (t == tr->current_trace)
>  		goto out;
>  
>  	trace_branch_disable();
>  
> -	current_trace->enabled = false;
> +	tr->current_trace->enabled = false;
>  
> -	if (current_trace->reset)
> -		current_trace->reset(tr);
> +	if (tr->current_trace->reset)
> +		tr->current_trace->reset(tr);
>  
> -	had_max_tr = current_trace->allocated_snapshot;
> -	current_trace = &nop_trace;
> +	/* Current trace needs to be nop_trace before synchronize_sched */
> +	tr->current_trace = &nop_trace;
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	had_max_tr = tr->allocated_snapshot;
>  
>  	if (had_max_tr && !t->use_max_tr) {
>  		/*
> @@ -3293,27 +3678,20 @@ static int tracing_set_tracer(const char *buf)
>  		 * so a synchronized_sched() is sufficient.
>  		 */
>  		synchronize_sched();
> -		/*
> -		 * We don't free the ring buffer. instead, resize it because
> -		 * The max_tr ring buffer has some state (e.g. ring->clock) and
> -		 * we want preserve it.
> -		 */
> -		ring_buffer_resize(max_tr.buffer, 1, RING_BUFFER_ALL_CPUS);
> -		set_buffer_entries(&max_tr, 1);
> -		tracing_reset_online_cpus(&max_tr);
> -		current_trace->allocated_snapshot = false;
> +		free_snapshot(tr);
>  	}
> +#endif
>  	destroy_trace_option_files(topts);
>  
> -	topts = create_trace_option_files(t);
> +	topts = create_trace_option_files(tr, t);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
>  	if (t->use_max_tr && !had_max_tr) {
> -		/* we need to make per cpu buffer sizes equivalent */
> -		ret = resize_buffer_duplicate_size(&max_tr, &global_trace,
> -						   RING_BUFFER_ALL_CPUS);
> +		ret = alloc_snapshot(tr);
>  		if (ret < 0)
>  			goto out;
> -		t->allocated_snapshot = true;
>  	}
> +#endif
>  
>  	if (t->init) {
>  		ret = tracer_init(t, tr);
> @@ -3321,8 +3699,8 @@ static int tracing_set_tracer(const char *buf)
>  			goto out;
>  	}
>  
> -	current_trace = t;
> -	current_trace->enabled = true;
> +	tr->current_trace = t;
> +	tr->current_trace->enabled = true;
>  	trace_branch_enable(tr);
>   out:
>  	mutex_unlock(&trace_types_lock);
> @@ -3396,7 +3774,8 @@ tracing_max_lat_write(struct file *filp, const char __user *ubuf,
>  
>  static int tracing_open_pipe(struct inode *inode, struct file *filp)
>  {
> -	long cpu_file = (long) inode->i_private;
> +	struct trace_cpu *tc = inode->i_private;
> +	struct trace_array *tr = tc->tr;
>  	struct trace_iterator *iter;
>  	int ret = 0;
>  
> @@ -3421,7 +3800,7 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
>  		ret = -ENOMEM;
>  		goto fail;
>  	}
> -	*iter->trace = *current_trace;
> +	*iter->trace = *tr->current_trace;
>  
>  	if (!alloc_cpumask_var(&iter->started, GFP_KERNEL)) {
>  		ret = -ENOMEM;
> @@ -3438,8 +3817,9 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
>  	if (trace_clocks[trace_clock_id].in_ns)
>  		iter->iter_flags |= TRACE_FILE_TIME_IN_NS;
>  
> -	iter->cpu_file = cpu_file;
> -	iter->tr = &global_trace;
> +	iter->cpu_file = tc->cpu;
> +	iter->tr = tc->tr;
> +	iter->trace_buffer = &tc->tr->trace_buffer;
>  	mutex_init(&iter->mutex);
>  	filp->private_data = iter;
>  
> @@ -3478,24 +3858,28 @@ static int tracing_release_pipe(struct inode *inode, struct file *file)
>  }
>  
>  static unsigned int
> -tracing_poll_pipe(struct file *filp, poll_table *poll_table)
> +trace_poll(struct trace_iterator *iter, struct file *filp, poll_table *poll_table)
>  {
> -	struct trace_iterator *iter = filp->private_data;
> +	/* Iterators are static, they should be filled or empty */
> +	if (trace_buffer_iter(iter, iter->cpu_file))
> +		return POLLIN | POLLRDNORM;
>  
> -	if (trace_flags & TRACE_ITER_BLOCK) {
> +	if (trace_flags & TRACE_ITER_BLOCK)
>  		/*
>  		 * Always select as readable when in blocking mode
>  		 */
>  		return POLLIN | POLLRDNORM;
> -	} else {
> -		if (!trace_empty(iter))
> -			return POLLIN | POLLRDNORM;
> -		poll_wait(filp, &trace_wait, poll_table);
> -		if (!trace_empty(iter))
> -			return POLLIN | POLLRDNORM;
> +	else
> +		return ring_buffer_poll_wait(iter->trace_buffer->buffer, iter->cpu_file,
> +					     filp, poll_table);
> +}
>  
> -		return 0;
> -	}
> +static unsigned int
> +tracing_poll_pipe(struct file *filp, poll_table *poll_table)
> +{
> +	struct trace_iterator *iter = filp->private_data;
> +
> +	return trace_poll(iter, filp, poll_table);
>  }
>  
>  /*
> @@ -3561,6 +3945,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf,
>  		  size_t cnt, loff_t *ppos)
>  {
>  	struct trace_iterator *iter = filp->private_data;
> +	struct trace_array *tr = iter->tr;
>  	ssize_t sret;
>  
>  	/* return any leftover data */
> @@ -3572,8 +3957,8 @@ tracing_read_pipe(struct file *filp, char __user *ubuf,
>  
>  	/* copy the tracer to avoid using a global lock all around */
>  	mutex_lock(&trace_types_lock);
> -	if (unlikely(iter->trace->name != current_trace->name))
> -		*iter->trace = *current_trace;
> +	if (unlikely(iter->trace->name != tr->current_trace->name))
> +		*iter->trace = *tr->current_trace;
>  	mutex_unlock(&trace_types_lock);
>  
>  	/*
> @@ -3729,6 +4114,7 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>  		.ops		= &tracing_pipe_buf_ops,
>  		.spd_release	= tracing_spd_release_pipe,
>  	};
> +	struct trace_array *tr = iter->tr;
>  	ssize_t ret;
>  	size_t rem;
>  	unsigned int i;
> @@ -3738,8 +4124,8 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>  
>  	/* copy the tracer to avoid using a global lock all around */
>  	mutex_lock(&trace_types_lock);
> -	if (unlikely(iter->trace->name != current_trace->name))
> -		*iter->trace = *current_trace;
> +	if (unlikely(iter->trace->name != tr->current_trace->name))
> +		*iter->trace = *tr->current_trace;
>  	mutex_unlock(&trace_types_lock);
>  
>  	mutex_lock(&iter->mutex);
> @@ -3801,43 +4187,19 @@ out_err:
>  	goto out;
>  }
>  
> -struct ftrace_entries_info {
> -	struct trace_array	*tr;
> -	int			cpu;
> -};
> -
> -static int tracing_entries_open(struct inode *inode, struct file *filp)
> -{
> -	struct ftrace_entries_info *info;
> -
> -	if (tracing_disabled)
> -		return -ENODEV;
> -
> -	info = kzalloc(sizeof(*info), GFP_KERNEL);
> -	if (!info)
> -		return -ENOMEM;
> -
> -	info->tr = &global_trace;
> -	info->cpu = (unsigned long)inode->i_private;
> -
> -	filp->private_data = info;
> -
> -	return 0;
> -}
> -
>  static ssize_t
>  tracing_entries_read(struct file *filp, char __user *ubuf,
>  		     size_t cnt, loff_t *ppos)
>  {
> -	struct ftrace_entries_info *info = filp->private_data;
> -	struct trace_array *tr = info->tr;
> +	struct trace_cpu *tc = filp->private_data;
> +	struct trace_array *tr = tc->tr;
>  	char buf[64];
>  	int r = 0;
>  	ssize_t ret;
>  
>  	mutex_lock(&trace_types_lock);
>  
> -	if (info->cpu == RING_BUFFER_ALL_CPUS) {
> +	if (tc->cpu == RING_BUFFER_ALL_CPUS) {
>  		int cpu, buf_size_same;
>  		unsigned long size;
>  
> @@ -3847,8 +4209,8 @@ tracing_entries_read(struct file *filp, char __user *ubuf,
>  		for_each_tracing_cpu(cpu) {
>  			/* fill in the size from first enabled cpu */
>  			if (size == 0)
> -				size = tr->data[cpu]->entries;
> -			if (size != tr->data[cpu]->entries) {
> +				size = per_cpu_ptr(tr->trace_buffer.data, cpu)->entries;
> +			if (size != per_cpu_ptr(tr->trace_buffer.data, cpu)->entries) {
>  				buf_size_same = 0;
>  				break;
>  			}
> @@ -3864,7 +4226,7 @@ tracing_entries_read(struct file *filp, char __user *ubuf,
>  		} else
>  			r = sprintf(buf, "X\n");
>  	} else
> -		r = sprintf(buf, "%lu\n", tr->data[info->cpu]->entries >> 10);
> +		r = sprintf(buf, "%lu\n", per_cpu_ptr(tr->trace_buffer.data, tc->cpu)->entries >> 10);
>  
>  	mutex_unlock(&trace_types_lock);
>  
> @@ -3876,7 +4238,7 @@ static ssize_t
>  tracing_entries_write(struct file *filp, const char __user *ubuf,
>  		      size_t cnt, loff_t *ppos)
>  {
> -	struct ftrace_entries_info *info = filp->private_data;
> +	struct trace_cpu *tc = filp->private_data;
>  	unsigned long val;
>  	int ret;
>  
> @@ -3891,7 +4253,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
>  	/* value is in KB */
>  	val <<= 10;
>  
> -	ret = tracing_resize_ring_buffer(val, info->cpu);
> +	ret = tracing_resize_ring_buffer(tc->tr, val, tc->cpu);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -3900,16 +4262,6 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
>  	return cnt;
>  }
>  
> -static int
> -tracing_entries_release(struct inode *inode, struct file *filp)
> -{
> -	struct ftrace_entries_info *info = filp->private_data;
> -
> -	kfree(info);
> -
> -	return 0;
> -}
> -
>  static ssize_t
>  tracing_total_entries_read(struct file *filp, char __user *ubuf,
>  				size_t cnt, loff_t *ppos)
> @@ -3921,7 +4273,7 @@ tracing_total_entries_read(struct file *filp, char __user *ubuf,
>  
>  	mutex_lock(&trace_types_lock);
>  	for_each_tracing_cpu(cpu) {
> -		size += tr->data[cpu]->entries >> 10;
> +		size += per_cpu_ptr(tr->trace_buffer.data, cpu)->entries >> 10;
>  		if (!ring_buffer_expanded)
>  			expanded_size += trace_buf_size >> 10;
>  	}
> @@ -3951,11 +4303,13 @@ tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
>  static int
>  tracing_free_buffer_release(struct inode *inode, struct file *filp)
>  {
> +	struct trace_array *tr = inode->i_private;
> +
>  	/* disable tracing ? */
>  	if (trace_flags & TRACE_ITER_STOP_ON_FREE)
>  		tracing_off();
>  	/* resize the ring buffer to 0 */
> -	tracing_resize_ring_buffer(0, RING_BUFFER_ALL_CPUS);
> +	tracing_resize_ring_buffer(tr, 0, RING_BUFFER_ALL_CPUS);
>  
>  	return 0;
>  }
> @@ -4024,7 +4378,7 @@ tracing_mark_write(struct file *filp, const char __user *ubuf,
>  
>  	local_save_flags(irq_flags);
>  	size = sizeof(*entry) + cnt + 2; /* possible \n added */
> -	buffer = global_trace.buffer;
> +	buffer = global_trace.trace_buffer.buffer;
>  	event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
>  					  irq_flags, preempt_count());
>  	if (!event) {
> @@ -4066,13 +4420,14 @@ tracing_mark_write(struct file *filp, const char __user *ubuf,
>  
>  static int tracing_clock_show(struct seq_file *m, void *v)
>  {
> +	struct trace_array *tr = m->private;
>  	int i;
>  
>  	for (i = 0; i < ARRAY_SIZE(trace_clocks); i++)
>  		seq_printf(m,
>  			"%s%s%s%s", i ? " " : "",
> -			i == trace_clock_id ? "[" : "", trace_clocks[i].name,
> -			i == trace_clock_id ? "]" : "");
> +			i == tr->clock_id ? "[" : "", trace_clocks[i].name,
> +			i == tr->clock_id ? "]" : "");
>  	seq_putc(m, '\n');
>  
>  	return 0;
> @@ -4081,6 +4436,8 @@ static int tracing_clock_show(struct seq_file *m, void *v)
>  static ssize_t tracing_clock_write(struct file *filp, const char __user *ubuf,
>  				   size_t cnt, loff_t *fpos)
>  {
> +	struct seq_file *m = filp->private_data;
> +	struct trace_array *tr = m->private;
>  	char buf[64];
>  	const char *clockstr;
>  	int i;
> @@ -4102,20 +4459,23 @@ static ssize_t tracing_clock_write(struct file *filp, const char __user *ubuf,
>  	if (i == ARRAY_SIZE(trace_clocks))
>  		return -EINVAL;
>  
> -	trace_clock_id = i;
> -
>  	mutex_lock(&trace_types_lock);
>  
> -	ring_buffer_set_clock(global_trace.buffer, trace_clocks[i].func);
> -	if (max_tr.buffer)
> -		ring_buffer_set_clock(max_tr.buffer, trace_clocks[i].func);
> +	tr->clock_id = i;
> +
> +	ring_buffer_set_clock(tr->trace_buffer.buffer, trace_clocks[i].func);
>  
>  	/*
>  	 * New clock may not be consistent with the previous clock.
>  	 * Reset the buffer so that it doesn't have incomparable timestamps.
>  	 */
> -	tracing_reset_online_cpus(&global_trace);
> -	tracing_reset_online_cpus(&max_tr);
> +	tracing_reset_online_cpus(&global_trace.trace_buffer);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (tr->flags & TRACE_ARRAY_FL_GLOBAL && tr->max_buffer.buffer)
> +		ring_buffer_set_clock(tr->max_buffer.buffer, trace_clocks[i].func);
> +	tracing_reset_online_cpus(&global_trace.max_buffer);
> +#endif
>  
>  	mutex_unlock(&trace_types_lock);
>  
> @@ -4128,20 +4488,45 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
>  {
>  	if (tracing_disabled)
>  		return -ENODEV;
> -	return single_open(file, tracing_clock_show, NULL);
> +
> +	return single_open(file, tracing_clock_show, inode->i_private);
>  }
>  
> +struct ftrace_buffer_info {
> +	struct trace_iterator	iter;
> +	void			*spare;
> +	unsigned int		read;
> +};
> +
>  #ifdef CONFIG_TRACER_SNAPSHOT
>  static int tracing_snapshot_open(struct inode *inode, struct file *file)
>  {
> +	struct trace_cpu *tc = inode->i_private;
>  	struct trace_iterator *iter;
> +	struct seq_file *m;
>  	int ret = 0;
>  
>  	if (file->f_mode & FMODE_READ) {
>  		iter = __tracing_open(inode, file, true);
>  		if (IS_ERR(iter))
>  			ret = PTR_ERR(iter);
> +	} else {
> +		/* Writes still need the seq_file to hold the private data */
> +		m = kzalloc(sizeof(*m), GFP_KERNEL);
> +		if (!m)
> +			return -ENOMEM;
> +		iter = kzalloc(sizeof(*iter), GFP_KERNEL);
> +		if (!iter) {
> +			kfree(m);
> +			return -ENOMEM;
> +		}
> +		iter->tr = tc->tr;
> +		iter->trace_buffer = &tc->tr->max_buffer;
> +		iter->cpu_file = tc->cpu;
> +		m->private = iter;
> +		file->private_data = m;
>  	}
> +
>  	return ret;
>  }
>  
> @@ -4149,6 +4534,9 @@ static ssize_t
>  tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  		       loff_t *ppos)
>  {
> +	struct seq_file *m = filp->private_data;
> +	struct trace_iterator *iter = m->private;
> +	struct trace_array *tr = iter->tr;
>  	unsigned long val;
>  	int ret;
>  
> @@ -4162,40 +4550,48 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  
>  	mutex_lock(&trace_types_lock);
>  
> -	if (current_trace->use_max_tr) {
> +	if (tr->current_trace->use_max_tr) {
>  		ret = -EBUSY;
>  		goto out;
>  	}
>  
>  	switch (val) {
>  	case 0:
> -		if (current_trace->allocated_snapshot) {
> -			/* free spare buffer */
> -			ring_buffer_resize(max_tr.buffer, 1,
> -					   RING_BUFFER_ALL_CPUS);
> -			set_buffer_entries(&max_tr, 1);
> -			tracing_reset_online_cpus(&max_tr);
> -			current_trace->allocated_snapshot = false;
> +		if (iter->cpu_file != RING_BUFFER_ALL_CPUS) {
> +			ret = -EINVAL;
> +			break;
>  		}
> +		if (tr->allocated_snapshot)
> +			free_snapshot(tr);
>  		break;
>  	case 1:
> -		if (!current_trace->allocated_snapshot) {
> -			/* allocate spare buffer */
> -			ret = resize_buffer_duplicate_size(&max_tr,
> -					&global_trace, RING_BUFFER_ALL_CPUS);
> +/* Only allow per-cpu swap if the ring buffer supports it */
> +#ifndef CONFIG_RING_BUFFER_ALLOW_SWAP
> +		if (iter->cpu_file != RING_BUFFER_ALL_CPUS) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +#endif
> +		if (!tr->allocated_snapshot) {
> +			ret = alloc_snapshot(tr);
>  			if (ret < 0)
>  				break;
> -			current_trace->allocated_snapshot = true;
>  		}
> -
>  		local_irq_disable();
>  		/* Now, we're going to swap */
> -		update_max_tr(&global_trace, current, smp_processor_id());
> +		if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
> +			update_max_tr(tr, current, smp_processor_id());
> +		else
> +			update_max_tr_single(tr, current, iter->cpu_file);
>  		local_irq_enable();
>  		break;
>  	default:
> -		if (current_trace->allocated_snapshot)
> -			tracing_reset_online_cpus(&max_tr);
> +		if (tr->allocated_snapshot) {
> +			if (iter->cpu_file == RING_BUFFER_ALL_CPUS)
> +				tracing_reset_online_cpus(&tr->max_buffer);
> +			else
> +				tracing_reset(&tr->max_buffer, iter->cpu_file);
> +		}
>  		break;
>  	}
>  
> @@ -4207,6 +4603,51 @@ out:
>  	mutex_unlock(&trace_types_lock);
>  	return ret;
>  }
> +
> +static int tracing_snapshot_release(struct inode *inode, struct file *file)
> +{
> +	struct seq_file *m = file->private_data;
> +
> +	if (file->f_mode & FMODE_READ)
> +		return tracing_release(inode, file);
> +
> +	/* If write only, the seq_file is just a stub */
> +	if (m)
> +		kfree(m->private);
> +	kfree(m);
> +
> +	return 0;
> +}
> +
> +static int tracing_buffers_open(struct inode *inode, struct file *filp);
> +static ssize_t tracing_buffers_read(struct file *filp, char __user *ubuf,
> +				    size_t count, loff_t *ppos);
> +static int tracing_buffers_release(struct inode *inode, struct file *file);
> +static ssize_t tracing_buffers_splice_read(struct file *file, loff_t *ppos,
> +		   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
> +
> +static int snapshot_raw_open(struct inode *inode, struct file *filp)
> +{
> +	struct ftrace_buffer_info *info;
> +	int ret;
> +
> +	ret = tracing_buffers_open(inode, filp);
> +	if (ret < 0)
> +		return ret;
> +
> +	info = filp->private_data;
> +
> +	if (info->iter.trace->use_max_tr) {
> +		tracing_buffers_release(inode, filp);
> +		return -EBUSY;
> +	}
> +
> +	info->iter.snapshot = true;
> +	info->iter.trace_buffer = &info->iter.tr->max_buffer;
> +
> +	return ret;
> +}
> +
>  #endif /* CONFIG_TRACER_SNAPSHOT */
>  
> 
> @@ -4234,10 +4675,9 @@ static const struct file_operations tracing_pipe_fops = {
>  };
>  
>  static const struct file_operations tracing_entries_fops = {
> -	.open		= tracing_entries_open,
> +	.open		= tracing_open_generic,
>  	.read		= tracing_entries_read,
>  	.write		= tracing_entries_write,
> -	.release	= tracing_entries_release,
>  	.llseek		= generic_file_llseek,
>  };
>  
> @@ -4272,20 +4712,23 @@ static const struct file_operations snapshot_fops = {
>  	.read		= seq_read,
>  	.write		= tracing_snapshot_write,
>  	.llseek		= tracing_seek,
> -	.release	= tracing_release,
> +	.release	= tracing_snapshot_release,
>  };
> -#endif /* CONFIG_TRACER_SNAPSHOT */
>  
> -struct ftrace_buffer_info {
> -	struct trace_array	*tr;
> -	void			*spare;
> -	int			cpu;
> -	unsigned int		read;
> +static const struct file_operations snapshot_raw_fops = {
> +	.open		= snapshot_raw_open,
> +	.read		= tracing_buffers_read,
> +	.release	= tracing_buffers_release,
> +	.splice_read	= tracing_buffers_splice_read,
> +	.llseek		= no_llseek,
>  };
>  
> +#endif /* CONFIG_TRACER_SNAPSHOT */
> +
>  static int tracing_buffers_open(struct inode *inode, struct file *filp)
>  {
> -	int cpu = (int)(long)inode->i_private;
> +	struct trace_cpu *tc = inode->i_private;
> +	struct trace_array *tr = tc->tr;
>  	struct ftrace_buffer_info *info;
>  
>  	if (tracing_disabled)
> @@ -4295,72 +4738,131 @@ static int tracing_buffers_open(struct inode *inode, struct file *filp)
>  	if (!info)
>  		return -ENOMEM;
>  
> -	info->tr	= &global_trace;
> -	info->cpu	= cpu;
> -	info->spare	= NULL;
> +	mutex_lock(&trace_types_lock);
> +
> +	tr->ref++;
> +
> +	info->iter.tr		= tr;
> +	info->iter.cpu_file	= tc->cpu;
> +	info->iter.trace	= tr->current_trace;
> +	info->iter.trace_buffer = &tr->trace_buffer;
> +	info->spare		= NULL;
>  	/* Force reading ring buffer for first read */
> -	info->read	= (unsigned int)-1;
> +	info->read		= (unsigned int)-1;
>  
>  	filp->private_data = info;
>  
> +	mutex_unlock(&trace_types_lock);
> +
>  	return nonseekable_open(inode, filp);
>  }
>  
> +static unsigned int
> +tracing_buffers_poll(struct file *filp, poll_table *poll_table)
> +{
> +	struct ftrace_buffer_info *info = filp->private_data;
> +	struct trace_iterator *iter = &info->iter;
> +
> +	return trace_poll(iter, filp, poll_table);
> +}
> +
>  static ssize_t
>  tracing_buffers_read(struct file *filp, char __user *ubuf,
>  		     size_t count, loff_t *ppos)
>  {
>  	struct ftrace_buffer_info *info = filp->private_data;
> +	struct trace_iterator *iter = &info->iter;
>  	ssize_t ret;
> -	size_t size;
> +	ssize_t size;
>  
>  	if (!count)
>  		return 0;
>  
> +	mutex_lock(&trace_types_lock);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (iter->snapshot && iter->tr->current_trace->use_max_tr) {
> +		size = -EBUSY;
> +		goto out_unlock;
> +	}
> +#endif
> +
>  	if (!info->spare)
> -		info->spare = ring_buffer_alloc_read_page(info->tr->buffer, info->cpu);
> +		info->spare = ring_buffer_alloc_read_page(iter->trace_buffer->buffer,
> +							  iter->cpu_file);
> +	size = -ENOMEM;
>  	if (!info->spare)
> -		return -ENOMEM;
> +		goto out_unlock;
>  
>  	/* Do we have previous read data to read? */
>  	if (info->read < PAGE_SIZE)
>  		goto read;
>  
> -	trace_access_lock(info->cpu);
> -	ret = ring_buffer_read_page(info->tr->buffer,
> + again:
> +	trace_access_lock(iter->cpu_file);
> +	ret = ring_buffer_read_page(iter->trace_buffer->buffer,
>  				    &info->spare,
>  				    count,
> -				    info->cpu, 0);
> -	trace_access_unlock(info->cpu);
> -	if (ret < 0)
> -		return 0;
> +				    iter->cpu_file, 0);
> +	trace_access_unlock(iter->cpu_file);
>  
> -	info->read = 0;
> +	if (ret < 0) {
> +		if (trace_empty(iter)) {
> +			if ((filp->f_flags & O_NONBLOCK)) {
> +				size = -EAGAIN;
> +				goto out_unlock;
> +			}
> +			mutex_unlock(&trace_types_lock);
> +			iter->trace->wait_pipe(iter);
> +			mutex_lock(&trace_types_lock);
> +			if (signal_pending(current)) {
> +				size = -EINTR;
> +				goto out_unlock;
> +			}
> +			goto again;
> +		}
> +		size = 0;
> +		goto out_unlock;
> +	}
>  
> -read:
> +	info->read = 0;
> + read:
>  	size = PAGE_SIZE - info->read;
>  	if (size > count)
>  		size = count;
>  
>  	ret = copy_to_user(ubuf, info->spare + info->read, size);
> -	if (ret == size)
> -		return -EFAULT;
> +	if (ret == size) {
> +		size = -EFAULT;
> +		goto out_unlock;
> +	}
>  	size -= ret;
>  
>  	*ppos += size;
>  	info->read += size;
>  
> + out_unlock:
> +	mutex_unlock(&trace_types_lock);
> +
>  	return size;
>  }
>  
>  static int tracing_buffers_release(struct inode *inode, struct file *file)
>  {
>  	struct ftrace_buffer_info *info = file->private_data;
> +	struct trace_iterator *iter = &info->iter;
> +
> +	mutex_lock(&trace_types_lock);
> +
> +	WARN_ON(!iter->tr->ref);
> +	iter->tr->ref--;
>  
>  	if (info->spare)
> -		ring_buffer_free_read_page(info->tr->buffer, info->spare);
> +		ring_buffer_free_read_page(iter->trace_buffer->buffer, info->spare);
>  	kfree(info);
>  
> +	mutex_unlock(&trace_types_lock);
> +
>  	return 0;
>  }
>  
> @@ -4425,6 +4927,7 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
>  			    unsigned int flags)
>  {
>  	struct ftrace_buffer_info *info = file->private_data;
> +	struct trace_iterator *iter = &info->iter;
>  	struct partial_page partial_def[PIPE_DEF_BUFFERS];
>  	struct page *pages_def[PIPE_DEF_BUFFERS];
>  	struct splice_pipe_desc spd = {
> @@ -4437,10 +4940,21 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
>  	};
>  	struct buffer_ref *ref;
>  	int entries, size, i;
> -	size_t ret;
> +	ssize_t ret;
>  
> -	if (splice_grow_spd(pipe, &spd))
> -		return -ENOMEM;
> +	mutex_lock(&trace_types_lock);
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	if (iter->snapshot && iter->tr->current_trace->use_max_tr) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +#endif
> +
> +	if (splice_grow_spd(pipe, &spd)) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
>  
>  	if (*ppos & (PAGE_SIZE - 1)) {
>  		ret = -EINVAL;
> @@ -4455,8 +4969,9 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
>  		len &= PAGE_MASK;
>  	}
>  
> -	trace_access_lock(info->cpu);
> -	entries = ring_buffer_entries_cpu(info->tr->buffer, info->cpu);
> + again:
> +	trace_access_lock(iter->cpu_file);
> +	entries = ring_buffer_entries_cpu(iter->trace_buffer->buffer, iter->cpu_file);
>  
>  	for (i = 0; i < pipe->buffers && len && entries; i++, len -= PAGE_SIZE) {
>  		struct page *page;
> @@ -4467,15 +4982,15 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
>  			break;
>  
>  		ref->ref = 1;
> -		ref->buffer = info->tr->buffer;
> -		ref->page = ring_buffer_alloc_read_page(ref->buffer, info->cpu);
> +		ref->buffer = iter->trace_buffer->buffer;
> +		ref->page = ring_buffer_alloc_read_page(ref->buffer, iter->cpu_file);
>  		if (!ref->page) {
>  			kfree(ref);
>  			break;
>  		}
>  
>  		r = ring_buffer_read_page(ref->buffer, &ref->page,
> -					  len, info->cpu, 1);
> +					  len, iter->cpu_file, 1);
>  		if (r < 0) {
>  			ring_buffer_free_read_page(ref->buffer, ref->page);
>  			kfree(ref);
> @@ -4499,31 +5014,40 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos,
>  		spd.nr_pages++;
>  		*ppos += PAGE_SIZE;
>  
> -		entries = ring_buffer_entries_cpu(info->tr->buffer, info->cpu);
> +		entries = ring_buffer_entries_cpu(iter->trace_buffer->buffer, iter->cpu_file);
>  	}
>  
> -	trace_access_unlock(info->cpu);
> +	trace_access_unlock(iter->cpu_file);
>  	spd.nr_pages = i;
>  
>  	/* did we read anything? */
>  	if (!spd.nr_pages) {
> -		if (flags & SPLICE_F_NONBLOCK)
> +		if ((file->f_flags & O_NONBLOCK) || (flags & SPLICE_F_NONBLOCK)) {
>  			ret = -EAGAIN;
> -		else
> -			ret = 0;
> -		/* TODO: block */
> -		goto out;
> +			goto out;
> +		}
> +		mutex_unlock(&trace_types_lock);
> +		iter->trace->wait_pipe(iter);
> +		mutex_lock(&trace_types_lock);
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			goto out;
> +		}
> +		goto again;
>  	}
>  
>  	ret = splice_to_pipe(pipe, &spd);
>  	splice_shrink_spd(&spd);
>  out:
> +	mutex_unlock(&trace_types_lock);
> +
>  	return ret;
>  }
>  
>  static const struct file_operations tracing_buffers_fops = {
>  	.open		= tracing_buffers_open,
>  	.read		= tracing_buffers_read,
> +	.poll		= tracing_buffers_poll,
>  	.release	= tracing_buffers_release,
>  	.splice_read	= tracing_buffers_splice_read,
>  	.llseek		= no_llseek,
> @@ -4533,12 +5057,14 @@ static ssize_t
>  tracing_stats_read(struct file *filp, char __user *ubuf,
>  		   size_t count, loff_t *ppos)
>  {
> -	unsigned long cpu = (unsigned long)filp->private_data;
> -	struct trace_array *tr = &global_trace;
> +	struct trace_cpu *tc = filp->private_data;
> +	struct trace_array *tr = tc->tr;
> +	struct trace_buffer *trace_buf = &tr->trace_buffer;
>  	struct trace_seq *s;
>  	unsigned long cnt;
>  	unsigned long long t;
>  	unsigned long usec_rem;
> +	int cpu = tc->cpu;
>  
>  	s = kmalloc(sizeof(*s), GFP_KERNEL);
>  	if (!s)
> @@ -4546,41 +5072,41 @@ tracing_stats_read(struct file *filp, char __user *ubuf,
>  
>  	trace_seq_init(s);
>  
> -	cnt = ring_buffer_entries_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_entries_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "entries: %ld\n", cnt);
>  
> -	cnt = ring_buffer_overrun_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_overrun_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "overrun: %ld\n", cnt);
>  
> -	cnt = ring_buffer_commit_overrun_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_commit_overrun_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "commit overrun: %ld\n", cnt);
>  
> -	cnt = ring_buffer_bytes_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_bytes_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "bytes: %ld\n", cnt);
>  
>  	if (trace_clocks[trace_clock_id].in_ns) {
>  		/* local or global for trace_clock */
> -		t = ns2usecs(ring_buffer_oldest_event_ts(tr->buffer, cpu));
> +		t = ns2usecs(ring_buffer_oldest_event_ts(trace_buf->buffer, cpu));
>  		usec_rem = do_div(t, USEC_PER_SEC);
>  		trace_seq_printf(s, "oldest event ts: %5llu.%06lu\n",
>  								t, usec_rem);
>  
> -		t = ns2usecs(ring_buffer_time_stamp(tr->buffer, cpu));
> +		t = ns2usecs(ring_buffer_time_stamp(trace_buf->buffer, cpu));
>  		usec_rem = do_div(t, USEC_PER_SEC);
>  		trace_seq_printf(s, "now ts: %5llu.%06lu\n", t, usec_rem);
>  	} else {
>  		/* counter or tsc mode for trace_clock */
>  		trace_seq_printf(s, "oldest event ts: %llu\n",
> -				ring_buffer_oldest_event_ts(tr->buffer, cpu));
> +				ring_buffer_oldest_event_ts(trace_buf->buffer, cpu));
>  
>  		trace_seq_printf(s, "now ts: %llu\n",
> -				ring_buffer_time_stamp(tr->buffer, cpu));
> +				ring_buffer_time_stamp(trace_buf->buffer, cpu));
>  	}
>  
> -	cnt = ring_buffer_dropped_events_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_dropped_events_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "dropped events: %ld\n", cnt);
>  
> -	cnt = ring_buffer_read_events_cpu(tr->buffer, cpu);
> +	cnt = ring_buffer_read_events_cpu(trace_buf->buffer, cpu);
>  	trace_seq_printf(s, "read events: %ld\n", cnt);
>  
>  	count = simple_read_from_buffer(ubuf, count, ppos, s->buffer, s->len);
> @@ -4632,60 +5158,161 @@ static const struct file_operations tracing_dyn_info_fops = {
>  	.read		= tracing_read_dyn_info,
>  	.llseek		= generic_file_llseek,
>  };
> -#endif
> +#endif /* CONFIG_DYNAMIC_FTRACE */
> +
> +#if defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE)
> +static void
> +ftrace_snapshot(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	tracing_snapshot();
> +}
>  
> -static struct dentry *d_tracer;
> +static void
> +ftrace_count_snapshot(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	unsigned long *count = (long *)data;
>  
> -struct dentry *tracing_init_dentry(void)
> +	if (!*count)
> +		return;
> +
> +	if (*count != -1)
> +		(*count)--;
> +
> +	tracing_snapshot();
> +}
> +
> +static int
> +ftrace_snapshot_print(struct seq_file *m, unsigned long ip,
> +		      struct ftrace_probe_ops *ops, void *data)
> +{
> +	long count = (long)data;
> +
> +	seq_printf(m, "%ps:", (void *)ip);
> +
> +	seq_printf(m, "snapshot");
> +
> +	if (count == -1)
> +		seq_printf(m, ":unlimited\n");
> +	else
> +		seq_printf(m, ":count=%ld\n", count);
> +
> +	return 0;
> +}
> +
> +static struct ftrace_probe_ops snapshot_probe_ops = {
> +	.func			= ftrace_snapshot,
> +	.print			= ftrace_snapshot_print,
> +};
> +
> +static struct ftrace_probe_ops snapshot_count_probe_ops = {
> +	.func			= ftrace_count_snapshot,
> +	.print			= ftrace_snapshot_print,
> +};
> +
> +static int
> +ftrace_trace_snapshot_callback(struct ftrace_hash *hash,
> +			       char *glob, char *cmd, char *param, int enable)
>  {
> -	static int once;
> +	struct ftrace_probe_ops *ops;
> +	void *count = (void *)-1;
> +	char *number;
> +	int ret;
> +
> +	/* hash funcs only work with set_ftrace_filter */
> +	if (!enable)
> +		return -EINVAL;
> +
> +	ops = param ? &snapshot_count_probe_ops :  &snapshot_probe_ops;
> +
> +	if (glob[0] == '!') {
> +		unregister_ftrace_function_probe_func(glob+1, ops);
> +		return 0;
> +	}
> +
> +	if (!param)
> +		goto out_reg;
> +
> +	number = strsep(&param, ":");
> +
> +	if (!strlen(number))
> +		goto out_reg;
> +
> +	/*
> +	 * We use the callback data field (which is a pointer)
> +	 * as our counter.
> +	 */
> +	ret = kstrtoul(number, 0, (unsigned long *)&count);
> +	if (ret)
> +		return ret;
> +
> + out_reg:
> +	ret = register_ftrace_function_probe(glob, ops, count);
> +
> +	if (ret >= 0)
> +		alloc_snapshot(&global_trace);
> +
> +	return ret < 0 ? ret : 0;
> +}
> +
> +static struct ftrace_func_command ftrace_snapshot_cmd = {
> +	.name			= "snapshot",
> +	.func			= ftrace_trace_snapshot_callback,
> +};
> +
> +static int register_snapshot_cmd(void)
> +{
> +	return register_ftrace_command(&ftrace_snapshot_cmd);
> +}
> +#else
> +static inline int register_snapshot_cmd(void) { return 0; }
> +#endif /* defined(CONFIG_TRACER_SNAPSHOT) && defined(CONFIG_DYNAMIC_FTRACE) */
>  
> -	if (d_tracer)
> -		return d_tracer;
> +struct dentry *tracing_init_dentry_tr(struct trace_array *tr)
> +{
> +	if (tr->dir)
> +		return tr->dir;
>  
>  	if (!debugfs_initialized())
>  		return NULL;
>  
> -	d_tracer = debugfs_create_dir("tracing", NULL);
> +	if (tr->flags & TRACE_ARRAY_FL_GLOBAL)
> +		tr->dir = debugfs_create_dir("tracing", NULL);
>  
> -	if (!d_tracer && !once) {
> -		once = 1;
> -		pr_warning("Could not create debugfs directory 'tracing'\n");
> -		return NULL;
> -	}
> +	if (!tr->dir)
> +		pr_warn_once("Could not create debugfs directory 'tracing'\n");
>  
> -	return d_tracer;
> +	return tr->dir;
>  }
>  
> -static struct dentry *d_percpu;
> +struct dentry *tracing_init_dentry(void)
> +{
> +	return tracing_init_dentry_tr(&global_trace);
> +}
>  
> -static struct dentry *tracing_dentry_percpu(void)
> +static struct dentry *tracing_dentry_percpu(struct trace_array *tr, int cpu)
>  {
> -	static int once;
>  	struct dentry *d_tracer;
>  
> -	if (d_percpu)
> -		return d_percpu;
> -
> -	d_tracer = tracing_init_dentry();
> +	if (tr->percpu_dir)
> +		return tr->percpu_dir;
>  
> +	d_tracer = tracing_init_dentry_tr(tr);
>  	if (!d_tracer)
>  		return NULL;
>  
> -	d_percpu = debugfs_create_dir("per_cpu", d_tracer);
> +	tr->percpu_dir = debugfs_create_dir("per_cpu", d_tracer);
>  
> -	if (!d_percpu && !once) {
> -		once = 1;
> -		pr_warning("Could not create debugfs directory 'per_cpu'\n");
> -		return NULL;
> -	}
> +	WARN_ONCE(!tr->percpu_dir,
> +		  "Could not create debugfs directory 'per_cpu/%d'\n", cpu);
>  
> -	return d_percpu;
> +	return tr->percpu_dir;
>  }
>  
> -static void tracing_init_debugfs_percpu(long cpu)
> +static void
> +tracing_init_debugfs_percpu(struct trace_array *tr, long cpu)
>  {
> -	struct dentry *d_percpu = tracing_dentry_percpu();
> +	struct trace_array_cpu *data = per_cpu_ptr(tr->trace_buffer.data, cpu);
> +	struct dentry *d_percpu = tracing_dentry_percpu(tr, cpu);
>  	struct dentry *d_cpu;
>  	char cpu_dir[30]; /* 30 characters should be more than enough */
>  
> @@ -4701,20 +5328,28 @@ static void tracing_init_debugfs_percpu(long cpu)
>  
>  	/* per cpu trace_pipe */
>  	trace_create_file("trace_pipe", 0444, d_cpu,
> -			(void *) cpu, &tracing_pipe_fops);
> +			(void *)&data->trace_cpu, &tracing_pipe_fops);
>  
>  	/* per cpu trace */
>  	trace_create_file("trace", 0644, d_cpu,
> -			(void *) cpu, &tracing_fops);
> +			(void *)&data->trace_cpu, &tracing_fops);
>  
>  	trace_create_file("trace_pipe_raw", 0444, d_cpu,
> -			(void *) cpu, &tracing_buffers_fops);
> +			(void *)&data->trace_cpu, &tracing_buffers_fops);
>  
>  	trace_create_file("stats", 0444, d_cpu,
> -			(void *) cpu, &tracing_stats_fops);
> +			(void *)&data->trace_cpu, &tracing_stats_fops);
>  
>  	trace_create_file("buffer_size_kb", 0444, d_cpu,
> -			(void *) cpu, &tracing_entries_fops);
> +			(void *)&data->trace_cpu, &tracing_entries_fops);
> +
> +#ifdef CONFIG_TRACER_SNAPSHOT
> +	trace_create_file("snapshot", 0644, d_cpu,
> +			  (void *)&data->trace_cpu, &snapshot_fops);
> +
> +	trace_create_file("snapshot_raw", 0444, d_cpu,
> +			(void *)&data->trace_cpu, &snapshot_raw_fops);
> +#endif
>  }
>  
>  #ifdef CONFIG_FTRACE_SELFTEST
> @@ -4725,6 +5360,7 @@ static void tracing_init_debugfs_percpu(long cpu)
>  struct trace_option_dentry {
>  	struct tracer_opt		*opt;
>  	struct tracer_flags		*flags;
> +	struct trace_array		*tr;
>  	struct dentry			*entry;
>  };
>  
> @@ -4760,7 +5396,7 @@ trace_options_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  
>  	if (!!(topt->flags->val & topt->opt->bit) != val) {
>  		mutex_lock(&trace_types_lock);
> -		ret = __set_tracer_option(current_trace, topt->flags,
> +		ret = __set_tracer_option(topt->tr->current_trace, topt->flags,
>  					  topt->opt, !val);
>  		mutex_unlock(&trace_types_lock);
>  		if (ret)
> @@ -4799,6 +5435,7 @@ static ssize_t
>  trace_options_core_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  			 loff_t *ppos)
>  {
> +	struct trace_array *tr = &global_trace;
>  	long index = (long)filp->private_data;
>  	unsigned long val;
>  	int ret;
> @@ -4811,7 +5448,7 @@ trace_options_core_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  		return -EINVAL;
>  
>  	mutex_lock(&trace_types_lock);
> -	ret = set_tracer_flag(1 << index, val);
> +	ret = set_tracer_flag(tr, 1 << index, val);
>  	mutex_unlock(&trace_types_lock);
>  
>  	if (ret < 0)
> @@ -4845,40 +5482,41 @@ struct dentry *trace_create_file(const char *name,
>  }
>  
> 
> -static struct dentry *trace_options_init_dentry(void)
> +static struct dentry *trace_options_init_dentry(struct trace_array *tr)
>  {
>  	struct dentry *d_tracer;
> -	static struct dentry *t_options;
>  
> -	if (t_options)
> -		return t_options;
> +	if (tr->options)
> +		return tr->options;
>  
> -	d_tracer = tracing_init_dentry();
> +	d_tracer = tracing_init_dentry_tr(tr);
>  	if (!d_tracer)
>  		return NULL;
>  
> -	t_options = debugfs_create_dir("options", d_tracer);
> -	if (!t_options) {
> +	tr->options = debugfs_create_dir("options", d_tracer);
> +	if (!tr->options) {
>  		pr_warning("Could not create debugfs directory 'options'\n");
>  		return NULL;
>  	}
>  
> -	return t_options;
> +	return tr->options;
>  }
>  
>  static void
> -create_trace_option_file(struct trace_option_dentry *topt,
> +create_trace_option_file(struct trace_array *tr,
> +			 struct trace_option_dentry *topt,
>  			 struct tracer_flags *flags,
>  			 struct tracer_opt *opt)
>  {
>  	struct dentry *t_options;
>  
> -	t_options = trace_options_init_dentry();
> +	t_options = trace_options_init_dentry(tr);
>  	if (!t_options)
>  		return;
>  
>  	topt->flags = flags;
>  	topt->opt = opt;
> +	topt->tr = tr;
>  
>  	topt->entry = trace_create_file(opt->name, 0644, t_options, topt,
>  				    &trace_options_fops);
> @@ -4886,7 +5524,7 @@ create_trace_option_file(struct trace_option_dentry *topt,
>  }
>  
>  static struct trace_option_dentry *
> -create_trace_option_files(struct tracer *tracer)
> +create_trace_option_files(struct trace_array *tr, struct tracer *tracer)
>  {
>  	struct trace_option_dentry *topts;
>  	struct tracer_flags *flags;
> @@ -4911,7 +5549,7 @@ create_trace_option_files(struct tracer *tracer)
>  		return NULL;
>  
>  	for (cnt = 0; opts[cnt].name; cnt++)
> -		create_trace_option_file(&topts[cnt], flags,
> +		create_trace_option_file(tr, &topts[cnt], flags,
>  					 &opts[cnt]);
>  
>  	return topts;
> @@ -4934,11 +5572,12 @@ destroy_trace_option_files(struct trace_option_dentry *topts)
>  }
>  
>  static struct dentry *
> -create_trace_option_core_file(const char *option, long index)
> +create_trace_option_core_file(struct trace_array *tr,
> +			      const char *option, long index)
>  {
>  	struct dentry *t_options;
>  
> -	t_options = trace_options_init_dentry();
> +	t_options = trace_options_init_dentry(tr);
>  	if (!t_options)
>  		return NULL;
>  
> @@ -4946,17 +5585,17 @@ create_trace_option_core_file(const char *option, long index)
>  				    &trace_options_core_fops);
>  }
>  
> -static __init void create_trace_options_dir(void)
> +static __init void create_trace_options_dir(struct trace_array *tr)
>  {
>  	struct dentry *t_options;
>  	int i;
>  
> -	t_options = trace_options_init_dentry();
> +	t_options = trace_options_init_dentry(tr);
>  	if (!t_options)
>  		return;
>  
>  	for (i = 0; trace_options[i]; i++)
> -		create_trace_option_core_file(trace_options[i], i);
> +		create_trace_option_core_file(tr, trace_options[i], i);
>  }
>  
>  static ssize_t
> @@ -4964,7 +5603,7 @@ rb_simple_read(struct file *filp, char __user *ubuf,
>  	       size_t cnt, loff_t *ppos)
>  {
>  	struct trace_array *tr = filp->private_data;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	char buf[64];
>  	int r;
>  
> @@ -4983,7 +5622,7 @@ rb_simple_write(struct file *filp, const char __user *ubuf,
>  		size_t cnt, loff_t *ppos)
>  {
>  	struct trace_array *tr = filp->private_data;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	unsigned long val;
>  	int ret;
>  
> @@ -4995,12 +5634,12 @@ rb_simple_write(struct file *filp, const char __user *ubuf,
>  		mutex_lock(&trace_types_lock);
>  		if (val) {
>  			ring_buffer_record_on(buffer);
> -			if (current_trace->start)
> -				current_trace->start(tr);
> +			if (tr->current_trace->start)
> +				tr->current_trace->start(tr);
>  		} else {
>  			ring_buffer_record_off(buffer);
> -			if (current_trace->stop)
> -				current_trace->stop(tr);
> +			if (tr->current_trace->stop)
> +				tr->current_trace->stop(tr);
>  		}
>  		mutex_unlock(&trace_types_lock);
>  	}
> @@ -5017,23 +5656,308 @@ static const struct file_operations rb_simple_fops = {
>  	.llseek		= default_llseek,
>  };
>  
> +struct dentry *trace_instance_dir;
> +
> +static void
> +init_tracer_debugfs(struct trace_array *tr, struct dentry *d_tracer);
> +
> +static void init_trace_buffers(struct trace_array *tr, struct trace_buffer *buf)
> +{
> +	int cpu;
> +
> +	for_each_tracing_cpu(cpu) {
> +		memset(per_cpu_ptr(buf->data, cpu), 0, sizeof(struct trace_array_cpu));
> +		per_cpu_ptr(buf->data, cpu)->trace_cpu.cpu = cpu;
> +		per_cpu_ptr(buf->data, cpu)->trace_cpu.tr = tr;
> +	}
> +}
> +
> +static int
> +allocate_trace_buffer(struct trace_array *tr, struct trace_buffer *buf, int size)
> +{
> +	enum ring_buffer_flags rb_flags;
> +
> +	rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
> +
> +	buf->buffer = ring_buffer_alloc(size, rb_flags);
> +	if (!buf->buffer)
> +		return -ENOMEM;
> +
> +	buf->data = alloc_percpu(struct trace_array_cpu);
> +	if (!buf->data) {
> +		ring_buffer_free(buf->buffer);
> +		return -ENOMEM;
> +	}
> +
> +	init_trace_buffers(tr, buf);
> +
> +	/* Allocate the first page for all buffers */
> +	set_buffer_entries(&tr->trace_buffer,
> +			   ring_buffer_size(tr->trace_buffer.buffer, 0));
> +
> +	return 0;
> +}
> +
> +static int allocate_trace_buffers(struct trace_array *tr, int size)
> +{
> +	int ret;
> +
> +	ret = allocate_trace_buffer(tr, &tr->trace_buffer, size);
> +	if (ret)
> +		return ret;
> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	ret = allocate_trace_buffer(tr, &tr->max_buffer,
> +				    allocate_snapshot ? size : 1);
> +	if (WARN_ON(ret)) {
> +		ring_buffer_free(tr->trace_buffer.buffer);
> +		free_percpu(tr->trace_buffer.data);
> +		return -ENOMEM;
> +	}
> +	tr->allocated_snapshot = allocate_snapshot;
> +
> +	/*
> +	 * Only the top level trace array gets its snapshot allocated
> +	 * from the kernel command line.
> +	 */
> +	allocate_snapshot = false;
> +#endif
> +	return 0;
> +}
> +
> +static int new_instance_create(const char *name)
> +{
> +	struct trace_array *tr;
> +	int ret;
> +
> +	mutex_lock(&trace_types_lock);
> +
> +	ret = -EEXIST;
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
> +		if (tr->name && strcmp(tr->name, name) == 0)
> +			goto out_unlock;
> +	}
> +
> +	ret = -ENOMEM;
> +	tr = kzalloc(sizeof(*tr), GFP_KERNEL);
> +	if (!tr)
> +		goto out_unlock;
> +
> +	tr->name = kstrdup(name, GFP_KERNEL);
> +	if (!tr->name)
> +		goto out_free_tr;
> +
> +	raw_spin_lock_init(&tr->start_lock);
> +
> +	tr->current_trace = &nop_trace;
> +
> +	INIT_LIST_HEAD(&tr->systems);
> +	INIT_LIST_HEAD(&tr->events);
> +
> +	if (allocate_trace_buffers(tr, trace_buf_size) < 0)
> +		goto out_free_tr;
> +
> +	/* Holder for file callbacks */
> +	tr->trace_cpu.cpu = RING_BUFFER_ALL_CPUS;
> +	tr->trace_cpu.tr = tr;
> +
> +	tr->dir = debugfs_create_dir(name, trace_instance_dir);
> +	if (!tr->dir)
> +		goto out_free_tr;
> +
> +	ret = event_trace_add_tracer(tr->dir, tr);
> +	if (ret)
> +		goto out_free_tr;
> +
> +	init_tracer_debugfs(tr, tr->dir);
> +
> +	list_add(&tr->list, &ftrace_trace_arrays);
> +
> +	mutex_unlock(&trace_types_lock);
> +
> +	return 0;
> +
> + out_free_tr:
> +	if (tr->trace_buffer.buffer)
> +		ring_buffer_free(tr->trace_buffer.buffer);
> +	kfree(tr->name);
> +	kfree(tr);
> +
> + out_unlock:
> +	mutex_unlock(&trace_types_lock);
> +
> +	return ret;
> +
> +}
> +
> +static int instance_delete(const char *name)
> +{
> +	struct trace_array *tr;
> +	int found = 0;
> +	int ret;
> +
> +	mutex_lock(&trace_types_lock);
> +
> +	ret = -ENODEV;
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
> +		if (tr->name && strcmp(tr->name, name) == 0) {
> +			found = 1;
> +			break;
> +		}
> +	}
> +	if (!found)
> +		goto out_unlock;
> +
> +	ret = -EBUSY;
> +	if (tr->ref)
> +		goto out_unlock;
> +
> +	list_del(&tr->list);
> +
> +	event_trace_del_tracer(tr);
> +	debugfs_remove_recursive(tr->dir);
> +	free_percpu(tr->trace_buffer.data);
> +	ring_buffer_free(tr->trace_buffer.buffer);
> +
> +	kfree(tr->name);
> +	kfree(tr);
> +
> +	ret = 0;
> +
> + out_unlock:
> +	mutex_unlock(&trace_types_lock);
> +
> +	return ret;
> +}
> +
> +static int instance_mkdir (struct inode *inode, struct dentry *dentry, umode_t mode)
> +{
> +	struct dentry *parent;
> +	int ret;
> +
> +	/* Paranoid: Make sure the parent is the "instances" directory */
> +	parent = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
> +	if (WARN_ON_ONCE(parent != trace_instance_dir))
> +		return -ENOENT;
> +
> +	/*
> +	 * The inode mutex is locked, but debugfs_create_dir() will also
> +	 * take the mutex. As the instances directory can not be destroyed
> +	 * or changed in any other way, it is safe to unlock it, and
> +	 * let the dentry try. If two users try to make the same dir at
> +	 * the same time, then the new_instance_create() will determine the
> +	 * winner.
> +	 */
> +	mutex_unlock(&inode->i_mutex);
> +
> +	ret = new_instance_create(dentry->d_iname);
> +
> +	mutex_lock(&inode->i_mutex);
> +
> +	return ret;
> +}
> +
> +static int instance_rmdir(struct inode *inode, struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +	int ret;
> +
> +	/* Paranoid: Make sure the parent is the "instances" directory */
> +	parent = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
> +	if (WARN_ON_ONCE(parent != trace_instance_dir))
> +		return -ENOENT;
> +
> +	/* The caller did a dget() on dentry */
> +	mutex_unlock(&dentry->d_inode->i_mutex);
> +
> +	/*
> +	 * The inode mutex is locked, but debugfs_create_dir() will also
> +	 * take the mutex. As the instances directory can not be destroyed
> +	 * or changed in any other way, it is safe to unlock it, and
> +	 * let the dentry try. If two users try to make the same dir at
> +	 * the same time, then the instance_delete() will determine the
> +	 * winner.
> +	 */
> +	mutex_unlock(&inode->i_mutex);
> +
> +	ret = instance_delete(dentry->d_iname);
> +
> +	mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
> +	mutex_lock(&dentry->d_inode->i_mutex);
> +
> +	return ret;
> +}
> +
> +static const struct inode_operations instance_dir_inode_operations = {
> +	.lookup		= simple_lookup,
> +	.mkdir		= instance_mkdir,
> +	.rmdir		= instance_rmdir,
> +};
> +
> +static __init void create_trace_instances(struct dentry *d_tracer)
> +{
> +	trace_instance_dir = debugfs_create_dir("instances", d_tracer);
> +	if (WARN_ON(!trace_instance_dir))
> +		return;
> +
> +	/* Hijack the dir inode operations, to allow mkdir */
> +	trace_instance_dir->d_inode->i_op = &instance_dir_inode_operations;
> +}
> +
> +static void
> +init_tracer_debugfs(struct trace_array *tr, struct dentry *d_tracer)
> +{
> +	int cpu;
> +
> +	trace_create_file("trace_options", 0644, d_tracer,
> +			  tr, &tracing_iter_fops);
> +
> +	trace_create_file("trace", 0644, d_tracer,
> +			(void *)&tr->trace_cpu, &tracing_fops);
> +
> +	trace_create_file("trace_pipe", 0444, d_tracer,
> +			(void *)&tr->trace_cpu, &tracing_pipe_fops);
> +
> +	trace_create_file("buffer_size_kb", 0644, d_tracer,
> +			(void *)&tr->trace_cpu, &tracing_entries_fops);
> +
> +	trace_create_file("buffer_total_size_kb", 0444, d_tracer,
> +			  tr, &tracing_total_entries_fops);
> +
> +	trace_create_file("free_buffer", 0644, d_tracer,
> +			  tr, &tracing_free_buffer_fops);
> +
> +	trace_create_file("trace_marker", 0220, d_tracer,
> +			  tr, &tracing_mark_fops);
> +
> +	trace_create_file("trace_clock", 0644, d_tracer, tr,
> +			  &trace_clock_fops);
> +
> +	trace_create_file("tracing_on", 0644, d_tracer,
> +			    tr, &rb_simple_fops);
> +
> +#ifdef CONFIG_TRACER_SNAPSHOT
> +	trace_create_file("snapshot", 0644, d_tracer,
> +			  (void *)&tr->trace_cpu, &snapshot_fops);
> +#endif
> +
> +	for_each_tracing_cpu(cpu)
> +		tracing_init_debugfs_percpu(tr, cpu);
> +
> +}
> +
>  static __init int tracer_init_debugfs(void)
>  {
>  	struct dentry *d_tracer;
> -	int cpu;
>  
>  	trace_access_lock_init();
>  
>  	d_tracer = tracing_init_dentry();
>  
> -	trace_create_file("trace_options", 0644, d_tracer,
> -			NULL, &tracing_iter_fops);
> +	init_tracer_debugfs(&global_trace, d_tracer);
>  
>  	trace_create_file("tracing_cpumask", 0644, d_tracer,
> -			NULL, &tracing_cpumask_fops);
> -
> -	trace_create_file("trace", 0644, d_tracer,
> -			(void *) TRACE_PIPE_ALL_CPU, &tracing_fops);
> +			&global_trace, &tracing_cpumask_fops);
>  
>  	trace_create_file("available_tracers", 0444, d_tracer,
>  			&global_trace, &show_traces_fops);
> @@ -5052,44 +5976,17 @@ static __init int tracer_init_debugfs(void)
>  	trace_create_file("README", 0444, d_tracer,
>  			NULL, &tracing_readme_fops);
>  
> -	trace_create_file("trace_pipe", 0444, d_tracer,
> -			(void *) TRACE_PIPE_ALL_CPU, &tracing_pipe_fops);
> -
> -	trace_create_file("buffer_size_kb", 0644, d_tracer,
> -			(void *) RING_BUFFER_ALL_CPUS, &tracing_entries_fops);
> -
> -	trace_create_file("buffer_total_size_kb", 0444, d_tracer,
> -			&global_trace, &tracing_total_entries_fops);
> -
> -	trace_create_file("free_buffer", 0644, d_tracer,
> -			&global_trace, &tracing_free_buffer_fops);
> -
> -	trace_create_file("trace_marker", 0220, d_tracer,
> -			NULL, &tracing_mark_fops);
> -
>  	trace_create_file("saved_cmdlines", 0444, d_tracer,
>  			NULL, &tracing_saved_cmdlines_fops);
>  
> -	trace_create_file("trace_clock", 0644, d_tracer, NULL,
> -			  &trace_clock_fops);
> -
> -	trace_create_file("tracing_on", 0644, d_tracer,
> -			    &global_trace, &rb_simple_fops);
> -
>  #ifdef CONFIG_DYNAMIC_FTRACE
>  	trace_create_file("dyn_ftrace_total_info", 0444, d_tracer,
>  			&ftrace_update_tot_cnt, &tracing_dyn_info_fops);
>  #endif
>  
> -#ifdef CONFIG_TRACER_SNAPSHOT
> -	trace_create_file("snapshot", 0644, d_tracer,
> -			  (void *) TRACE_PIPE_ALL_CPU, &snapshot_fops);
> -#endif
> -
> -	create_trace_options_dir();
> +	create_trace_instances(d_tracer);
>  
> -	for_each_tracing_cpu(cpu)
> -		tracing_init_debugfs_percpu(cpu);
> +	create_trace_options_dir(&global_trace);
>  
>  	return 0;
>  }
> @@ -5145,8 +6042,8 @@ void
>  trace_printk_seq(struct trace_seq *s)
>  {
>  	/* Probably should print a warning here. */
> -	if (s->len >= 1000)
> -		s->len = 1000;
> +	if (s->len >= TRACE_MAX_PRINT)
> +		s->len = TRACE_MAX_PRINT;
>  
>  	/* should be zero ended, but we are paranoid. */
>  	s->buffer[s->len] = 0;
> @@ -5159,46 +6056,43 @@ trace_printk_seq(struct trace_seq *s)
>  void trace_init_global_iter(struct trace_iterator *iter)
>  {
>  	iter->tr = &global_trace;
> -	iter->trace = current_trace;
> -	iter->cpu_file = TRACE_PIPE_ALL_CPU;
> +	iter->trace = iter->tr->current_trace;
> +	iter->cpu_file = RING_BUFFER_ALL_CPUS;
> +	iter->trace_buffer = &global_trace.trace_buffer;
>  }
>  
> -static void
> -__ftrace_dump(bool disable_tracing, enum ftrace_dump_mode oops_dump_mode)
> +void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
>  {
> -	static arch_spinlock_t ftrace_dump_lock =
> -		(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
>  	/* use static because iter can be a bit big for the stack */
>  	static struct trace_iterator iter;
> +	static atomic_t dump_running;
>  	unsigned int old_userobj;
> -	static int dump_ran;
>  	unsigned long flags;
>  	int cnt = 0, cpu;
>  
> -	/* only one dump */
> -	local_irq_save(flags);
> -	arch_spin_lock(&ftrace_dump_lock);
> -	if (dump_ran)
> -		goto out;
> -
> -	dump_ran = 1;
> +	/* Only allow one dump user at a time. */
> +	if (atomic_inc_return(&dump_running) != 1) {
> +		atomic_dec(&dump_running);
> +		return;
> +	}
>  
> +	/*
> +	 * Always turn off tracing when we dump.
> +	 * We don't need to show trace output of what happens
> +	 * between multiple crashes.
> +	 *
> +	 * If the user does a sysrq-z, then they can re-enable
> +	 * tracing with echo 1 > tracing_on.
> +	 */
>  	tracing_off();
>  
> -	/* Did function tracer already get disabled? */
> -	if (ftrace_is_dead()) {
> -		printk("# WARNING: FUNCTION TRACING IS CORRUPTED\n");
> -		printk("#          MAY BE MISSING FUNCTION EVENTS\n");
> -	}
> -
> -	if (disable_tracing)
> -		ftrace_kill();
> +	local_irq_save(flags);
>  
>  	/* Simulate the iterator */
>  	trace_init_global_iter(&iter);
>  
>  	for_each_tracing_cpu(cpu) {
> -		atomic_inc(&iter.tr->data[cpu]->disabled);
> +		atomic_inc(&per_cpu_ptr(iter.tr->trace_buffer.data, cpu)->disabled);
>  	}
>  
>  	old_userobj = trace_flags & TRACE_ITER_SYM_USEROBJ;
> @@ -5208,7 +6102,7 @@ __ftrace_dump(bool disable_tracing, enum ftrace_dump_mode oops_dump_mode)
>  
>  	switch (oops_dump_mode) {
>  	case DUMP_ALL:
> -		iter.cpu_file = TRACE_PIPE_ALL_CPU;
> +		iter.cpu_file = RING_BUFFER_ALL_CPUS;
>  		break;
>  	case DUMP_ORIG:
>  		iter.cpu_file = raw_smp_processor_id();
> @@ -5217,11 +6111,17 @@ __ftrace_dump(bool disable_tracing, enum ftrace_dump_mode oops_dump_mode)
>  		goto out_enable;
>  	default:
>  		printk(KERN_TRACE "Bad dumping mode, switching to all CPUs dump\n");
> -		iter.cpu_file = TRACE_PIPE_ALL_CPU;
> +		iter.cpu_file = RING_BUFFER_ALL_CPUS;
>  	}
>  
>  	printk(KERN_TRACE "Dumping ftrace buffer:\n");
>  
> +	/* Did function tracer already get disabled? */
> +	if (ftrace_is_dead()) {
> +		printk("# WARNING: FUNCTION TRACING IS CORRUPTED\n");
> +		printk("#          MAY BE MISSING FUNCTION EVENTS\n");
> +	}
> +
>  	/*
>  	 * We need to stop all tracing on all CPUS to read the
>  	 * the next buffer. This is a bit expensive, but is
> @@ -5261,33 +6161,19 @@ __ftrace_dump(bool disable_tracing, enum ftrace_dump_mode oops_dump_mode)
>  		printk(KERN_TRACE "---------------------------------\n");
>  
>   out_enable:
> -	/* Re-enable tracing if requested */
> -	if (!disable_tracing) {
> -		trace_flags |= old_userobj;
> +	trace_flags |= old_userobj;
>  
> -		for_each_tracing_cpu(cpu) {
> -			atomic_dec(&iter.tr->data[cpu]->disabled);
> -		}
> -		tracing_on();
> +	for_each_tracing_cpu(cpu) {
> +		atomic_dec(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled);
>  	}
> -
> - out:
> -	arch_spin_unlock(&ftrace_dump_lock);
> + 	atomic_dec(&dump_running);
>  	local_irq_restore(flags);
>  }
> -
> -/* By default: disable tracing after the dump */
> -void ftrace_dump(enum ftrace_dump_mode oops_dump_mode)
> -{
> -	__ftrace_dump(true, oops_dump_mode);
> -}
>  EXPORT_SYMBOL_GPL(ftrace_dump);
>  
>  __init static int tracer_alloc_buffers(void)
>  {
>  	int ring_buf_size;
> -	enum ring_buffer_flags rb_flags;
> -	int i;
>  	int ret = -ENOMEM;
>  
> 
> @@ -5308,49 +6194,27 @@ __init static int tracer_alloc_buffers(void)
>  	else
>  		ring_buf_size = 1;
>  
> -	rb_flags = trace_flags & TRACE_ITER_OVERWRITE ? RB_FL_OVERWRITE : 0;
> -
>  	cpumask_copy(tracing_buffer_mask, cpu_possible_mask);
>  	cpumask_copy(tracing_cpumask, cpu_all_mask);
>  
> +	raw_spin_lock_init(&global_trace.start_lock);
> +
>  	/* TODO: make the number of buffers hot pluggable with CPUS */
> -	global_trace.buffer = ring_buffer_alloc(ring_buf_size, rb_flags);
> -	if (!global_trace.buffer) {
> +	if (allocate_trace_buffers(&global_trace, ring_buf_size) < 0) {
>  		printk(KERN_ERR "tracer: failed to allocate ring buffer!\n");
>  		WARN_ON(1);
>  		goto out_free_cpumask;
>  	}
> +
>  	if (global_trace.buffer_disabled)
>  		tracing_off();
>  
> -
> -#ifdef CONFIG_TRACER_MAX_TRACE
> -	max_tr.buffer = ring_buffer_alloc(1, rb_flags);
> -	if (!max_tr.buffer) {
> -		printk(KERN_ERR "tracer: failed to allocate max ring buffer!\n");
> -		WARN_ON(1);
> -		ring_buffer_free(global_trace.buffer);
> -		goto out_free_cpumask;
> -	}
> -#endif
> -
> -	/* Allocate the first page for all buffers */
> -	for_each_tracing_cpu(i) {
> -		global_trace.data[i] = &per_cpu(global_trace_cpu, i);
> -		max_tr.data[i] = &per_cpu(max_tr_data, i);
> -	}
> -
> -	set_buffer_entries(&global_trace,
> -			   ring_buffer_size(global_trace.buffer, 0));
> -#ifdef CONFIG_TRACER_MAX_TRACE
> -	set_buffer_entries(&max_tr, 1);
> -#endif
> -
>  	trace_init_cmdlines();
> -	init_irq_work(&trace_work_wakeup, trace_wake_up);
>  
>  	register_tracer(&nop_trace);
>  
> +	global_trace.current_trace = &nop_trace;
> +
>  	/* All seems OK, enable tracing */
>  	tracing_disabled = 0;
>  
> @@ -5359,16 +6223,32 @@ __init static int tracer_alloc_buffers(void)
>  
>  	register_die_notifier(&trace_die_notifier);
>  
> +	global_trace.flags = TRACE_ARRAY_FL_GLOBAL;
> +
> +	/* Holder for file callbacks */
> +	global_trace.trace_cpu.cpu = RING_BUFFER_ALL_CPUS;
> +	global_trace.trace_cpu.tr = &global_trace;
> +
> +	INIT_LIST_HEAD(&global_trace.systems);
> +	INIT_LIST_HEAD(&global_trace.events);
> +	list_add(&global_trace.list, &ftrace_trace_arrays);
> +
>  	while (trace_boot_options) {
>  		char *option;
>  
>  		option = strsep(&trace_boot_options, ",");
> -		trace_set_options(option);
> +		trace_set_options(&global_trace, option);
>  	}
>  
> +	register_snapshot_cmd();
> +
>  	return 0;
>  
>  out_free_cpumask:
> +	free_percpu(global_trace.trace_buffer.data);
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	free_percpu(global_trace.max_buffer.data);
> +#endif
>  	free_cpumask_var(tracing_cpumask);
>  out_free_buffer_mask:
>  	free_cpumask_var(tracing_buffer_mask);
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 2081971..9e01458 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -13,6 +13,11 @@
>  #include <linux/trace_seq.h>
>  #include <linux/ftrace_event.h>
>  
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +#include <asm/unistd.h>		/* For NR_SYSCALLS	     */
> +#include <asm/syscall.h>	/* some archs define it here */
> +#endif
> +
>  enum trace_type {
>  	__TRACE_FIRST_TYPE = 0,
>  
> @@ -29,6 +34,7 @@ enum trace_type {
>  	TRACE_GRAPH_ENT,
>  	TRACE_USER_STACK,
>  	TRACE_BLK,
> +	TRACE_BPUTS,
>  
>  	__TRACE_LAST_TYPE,
>  };
> @@ -127,12 +133,21 @@ enum trace_flag_type {
>  
>  #define TRACE_BUF_SIZE		1024
>  
> +struct trace_array;
> +
> +struct trace_cpu {
> +	struct trace_array	*tr;
> +	struct dentry		*dir;
> +	int			cpu;
> +};
> +
>  /*
>   * The CPU trace array - it consists of thousands of trace entries
>   * plus some other descriptor data: (for example which task started
>   * the trace, etc.)
>   */
>  struct trace_array_cpu {
> +	struct trace_cpu	trace_cpu;
>  	atomic_t		disabled;
>  	void			*buffer_page;	/* ring buffer spare */
>  
> @@ -151,20 +166,83 @@ struct trace_array_cpu {
>  	char			comm[TASK_COMM_LEN];
>  };
>  
> +struct tracer;
> +
> +struct trace_buffer {
> +	struct trace_array		*tr;
> +	struct ring_buffer		*buffer;
> +	struct trace_array_cpu __percpu	*data;
> +	cycle_t				time_start;
> +	int				cpu;
> +};
> +
>  /*
>   * The trace array - an array of per-CPU trace arrays. This is the
>   * highest level data structure that individual tracers deal with.
>   * They have on/off state as well:
>   */
>  struct trace_array {
> -	struct ring_buffer	*buffer;
> -	int			cpu;
> +	struct list_head	list;
> +	char			*name;
> +	struct trace_buffer	trace_buffer;
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	/*
> +	 * The max_buffer is used to snapshot the trace when a maximum
> +	 * latency is reached, or when the user initiates a snapshot.
> +	 * Some tracers will use this to store a maximum trace while
> +	 * it continues examining live traces.
> +	 *
> +	 * The buffers for the max_buffer are set up the same as the trace_buffer
> +	 * When a snapshot is taken, the buffer of the max_buffer is swapped
> +	 * with the buffer of the trace_buffer and the buffers are reset for
> +	 * the trace_buffer so the tracing can continue.
> +	 */
> +	struct trace_buffer	max_buffer;
> +	bool			allocated_snapshot;
> +#endif
>  	int			buffer_disabled;
> -	cycle_t			time_start;
> +	struct trace_cpu	trace_cpu;	/* place holder */
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +	int			sys_refcount_enter;
> +	int			sys_refcount_exit;
> +	DECLARE_BITMAP(enabled_enter_syscalls, NR_syscalls);
> +	DECLARE_BITMAP(enabled_exit_syscalls, NR_syscalls);
> +#endif
> +	int			stop_count;
> +	int			clock_id;
> +	struct tracer		*current_trace;
> +	unsigned int		flags;
> +	raw_spinlock_t		start_lock;
> +	struct dentry		*dir;
> +	struct dentry		*options;
> +	struct dentry		*percpu_dir;
> +	struct dentry		*event_dir;
> +	struct list_head	systems;
> +	struct list_head	events;
>  	struct task_struct	*waiter;
> -	struct trace_array_cpu	*data[NR_CPUS];
> +	int			ref;
>  };
>  
> +enum {
> +	TRACE_ARRAY_FL_GLOBAL	= (1 << 0)
> +};
> +
> +extern struct list_head ftrace_trace_arrays;
> +
> +/*
> + * The global tracer (top) should be the first trace array added,
> + * but we check the flag anyway.
> + */
> +static inline struct trace_array *top_trace_array(void)
> +{
> +	struct trace_array *tr;
> +
> +	tr = list_entry(ftrace_trace_arrays.prev,
> +			typeof(*tr), list);
> +	WARN_ON(!(tr->flags & TRACE_ARRAY_FL_GLOBAL));
> +	return tr;
> +}
> +
>  #define FTRACE_CMP_TYPE(var, type) \
>  	__builtin_types_compatible_p(typeof(var), type *)
>  
> @@ -200,6 +278,7 @@ extern void __ftrace_bad_type(void);
>  		IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
>  		IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT);	\
>  		IF_ASSIGN(var, ent, struct bprint_entry, TRACE_BPRINT);	\
> +		IF_ASSIGN(var, ent, struct bputs_entry, TRACE_BPUTS);	\
>  		IF_ASSIGN(var, ent, struct trace_mmiotrace_rw,		\
>  			  TRACE_MMIO_RW);				\
>  		IF_ASSIGN(var, ent, struct trace_mmiotrace_map,		\
> @@ -289,9 +368,10 @@ struct tracer {
>  	struct tracer		*next;
>  	struct tracer_flags	*flags;
>  	bool			print_max;
> -	bool			use_max_tr;
> -	bool			allocated_snapshot;
>  	bool			enabled;
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +	bool			use_max_tr;
> +#endif
>  };
>  
> 
> @@ -427,8 +507,6 @@ static __always_inline void trace_clear_recursion(int bit)
>  	current->trace_recursion = val;
>  }
>  
> -#define TRACE_PIPE_ALL_CPU	-1
> -
>  static inline struct ring_buffer_iter *
>  trace_buffer_iter(struct trace_iterator *iter, int cpu)
>  {
> @@ -439,10 +517,10 @@ trace_buffer_iter(struct trace_iterator *iter, int cpu)
>  
>  int tracer_init(struct tracer *t, struct trace_array *tr);
>  int tracing_is_enabled(void);
> -void tracing_reset(struct trace_array *tr, int cpu);
> -void tracing_reset_online_cpus(struct trace_array *tr);
> +void tracing_reset(struct trace_buffer *buf, int cpu);
> +void tracing_reset_online_cpus(struct trace_buffer *buf);
>  void tracing_reset_current(int cpu);
> -void tracing_reset_current_online_cpus(void);
> +void tracing_reset_all_online_cpus(void);
>  int tracing_open_generic(struct inode *inode, struct file *filp);
>  struct dentry *trace_create_file(const char *name,
>  				 umode_t mode,
> @@ -450,6 +528,7 @@ struct dentry *trace_create_file(const char *name,
>  				 void *data,
>  				 const struct file_operations *fops);
>  
> +struct dentry *tracing_init_dentry_tr(struct trace_array *tr);
>  struct dentry *tracing_init_dentry(void);
>  
>  struct ring_buffer_event;
> @@ -583,7 +662,7 @@ extern int DYN_FTRACE_TEST_NAME(void);
>  #define DYN_FTRACE_TEST_NAME2 trace_selftest_dynamic_test_func2
>  extern int DYN_FTRACE_TEST_NAME2(void);
>  
> -extern int ring_buffer_expanded;
> +extern bool ring_buffer_expanded;
>  extern bool tracing_selftest_disabled;
>  DECLARE_PER_CPU(int, ftrace_cpu_disabled);
>  
> @@ -619,6 +698,8 @@ trace_array_vprintk(struct trace_array *tr,
>  		    unsigned long ip, const char *fmt, va_list args);
>  int trace_array_printk(struct trace_array *tr,
>  		       unsigned long ip, const char *fmt, ...);
> +int trace_array_printk_buf(struct ring_buffer *buffer,
> +			   unsigned long ip, const char *fmt, ...);
>  void trace_printk_seq(struct trace_seq *s);
>  enum print_line_t print_trace_line(struct trace_iterator *iter);
>  
> @@ -786,6 +867,7 @@ enum trace_iterator_flags {
>  	TRACE_ITER_STOP_ON_FREE		= 0x400000,
>  	TRACE_ITER_IRQ_INFO		= 0x800000,
>  	TRACE_ITER_MARKERS		= 0x1000000,
> +	TRACE_ITER_FUNCTION		= 0x2000000,
>  };
>  
>  /*
> @@ -832,8 +914,8 @@ enum {
>  
>  struct ftrace_event_field {
>  	struct list_head	link;
> -	char			*name;
> -	char			*type;
> +	const char		*name;
> +	const char		*type;
>  	int			filter_type;
>  	int			offset;
>  	int			size;
> @@ -851,12 +933,19 @@ struct event_filter {
>  struct event_subsystem {
>  	struct list_head	list;
>  	const char		*name;
> -	struct dentry		*entry;
>  	struct event_filter	*filter;
> -	int			nr_events;
>  	int			ref_count;
>  };
>  
> +struct ftrace_subsystem_dir {
> +	struct list_head		list;
> +	struct event_subsystem		*subsystem;
> +	struct trace_array		*tr;
> +	struct dentry			*entry;
> +	int				ref_count;
> +	int				nr_events;
> +};
> +
>  #define FILTER_PRED_INVALID	((unsigned short)-1)
>  #define FILTER_PRED_IS_RIGHT	(1 << 15)
>  #define FILTER_PRED_FOLD	(1 << 15)
> @@ -906,22 +995,20 @@ struct filter_pred {
>  	unsigned short		right;
>  };
>  
> -extern struct list_head ftrace_common_fields;
> -
>  extern enum regex_type
>  filter_parse_regex(char *buff, int len, char **search, int *not);
>  extern void print_event_filter(struct ftrace_event_call *call,
>  			       struct trace_seq *s);
>  extern int apply_event_filter(struct ftrace_event_call *call,
>  			      char *filter_string);
> -extern int apply_subsystem_event_filter(struct event_subsystem *system,
> +extern int apply_subsystem_event_filter(struct ftrace_subsystem_dir *dir,
>  					char *filter_string);
>  extern void print_subsystem_event_filter(struct event_subsystem *system,
>  					 struct trace_seq *s);
>  extern int filter_assign_type(const char *type);
>  
> -struct list_head *
> -trace_get_fields(struct ftrace_event_call *event_call);
> +struct ftrace_event_field *
> +trace_find_event_field(struct ftrace_event_call *call, char *name);
>  
>  static inline int
>  filter_check_discard(struct ftrace_event_call *call, void *rec,
> @@ -938,6 +1025,8 @@ filter_check_discard(struct ftrace_event_call *call, void *rec,
>  }
>  
>  extern void trace_event_enable_cmd_record(bool enable);
> +extern int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr);
> +extern int event_trace_del_tracer(struct trace_array *tr);
>  
>  extern struct mutex event_mutex;
>  extern struct list_head ftrace_events;
> @@ -948,7 +1037,18 @@ extern const char *__stop___trace_bprintk_fmt[];
>  void trace_printk_init_buffers(void);
>  void trace_printk_start_comm(void);
>  int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set);
> -int set_tracer_flag(unsigned int mask, int enabled);
> +int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled);
> +
> +/*
> + * Normal trace_printk() and friends allocates special buffers
> + * to do the manipulation, as well as saves the print formats
> + * into sections to display. But the trace infrastructure wants
> + * to use these without the added overhead at the price of being
> + * a bit slower (used mainly for warnings, where we don't care
> + * about performance). The internal_trace_puts() is for such
> + * a purpose.
> + */
> +#define internal_trace_puts(str) __trace_puts(_THIS_IP_, str, strlen(str))
>  
>  #undef FTRACE_ENTRY
>  #define FTRACE_ENTRY(call, struct_name, id, tstruct, print, filter)	\
> diff --git a/kernel/trace/trace_branch.c b/kernel/trace/trace_branch.c
> index 95e9684..d594da0 100644
> --- a/kernel/trace/trace_branch.c
> +++ b/kernel/trace/trace_branch.c
> @@ -32,6 +32,7 @@ probe_likely_condition(struct ftrace_branch_data *f, int val, int expect)
>  {
>  	struct ftrace_event_call *call = &event_branch;
>  	struct trace_array *tr = branch_tracer;
> +	struct trace_array_cpu *data;
>  	struct ring_buffer_event *event;
>  	struct trace_branch *entry;
>  	struct ring_buffer *buffer;
> @@ -51,11 +52,12 @@ probe_likely_condition(struct ftrace_branch_data *f, int val, int expect)
>  
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	if (atomic_inc_return(&tr->data[cpu]->disabled) != 1)
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
> +	if (atomic_inc_return(&data->disabled) != 1)
>  		goto out;
>  
>  	pc = preempt_count();
> -	buffer = tr->buffer;
> +	buffer = tr->trace_buffer.buffer;
>  	event = trace_buffer_lock_reserve(buffer, TRACE_BRANCH,
>  					  sizeof(*entry), flags, pc);
>  	if (!event)
> @@ -80,7 +82,7 @@ probe_likely_condition(struct ftrace_branch_data *f, int val, int expect)
>  		__buffer_unlock_commit(buffer, event);
>  
>   out:
> -	atomic_dec(&tr->data[cpu]->disabled);
> +	atomic_dec(&data->disabled);
>  	local_irq_restore(flags);
>  }
>  
> diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c
> index aa8f5f4..26dc348 100644
> --- a/kernel/trace/trace_clock.c
> +++ b/kernel/trace/trace_clock.c
> @@ -57,6 +57,16 @@ u64 notrace trace_clock(void)
>  	return local_clock();
>  }
>  
> +/*
> + * trace_jiffy_clock(): Simply use jiffies as a clock counter.
> + */
> +u64 notrace trace_clock_jiffies(void)
> +{
> +	u64 jiffy = jiffies - INITIAL_JIFFIES;
> +
> +	/* Return nsecs */
> +	return (u64)jiffies_to_usecs(jiffy) * 1000ULL;
> +}
>  
>  /*
>   * trace_clock_global(): special globally coherent trace clock
> diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
> index 4108e12..e2d027a 100644
> --- a/kernel/trace/trace_entries.h
> +++ b/kernel/trace/trace_entries.h
> @@ -223,8 +223,8 @@ FTRACE_ENTRY(bprint, bprint_entry,
>  		__dynamic_array(	u32,	buf	)
>  	),
>  
> -	F_printk("%08lx fmt:%p",
> -		 __entry->ip, __entry->fmt),
> +	F_printk("%pf: %s",
> +		 (void *)__entry->ip, __entry->fmt),
>  
>  	FILTER_OTHER
>  );
> @@ -238,8 +238,23 @@ FTRACE_ENTRY(print, print_entry,
>  		__dynamic_array(	char,	buf	)
>  	),
>  
> -	F_printk("%08lx %s",
> -		 __entry->ip, __entry->buf),
> +	F_printk("%pf: %s",
> +		 (void *)__entry->ip, __entry->buf),
> +
> +	FILTER_OTHER
> +);
> +
> +FTRACE_ENTRY(bputs, bputs_entry,
> +
> +	TRACE_BPUTS,
> +
> +	F_STRUCT(
> +		__field(	unsigned long,	ip	)
> +		__field(	const char *,	str	)
> +	),
> +
> +	F_printk("%pf: %s",
> +		 (void *)__entry->ip, __entry->str),
>  
>  	FILTER_OTHER
>  );
> diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
> index 57e9b28..53582e9 100644
> --- a/kernel/trace/trace_events.c
> +++ b/kernel/trace/trace_events.c
> @@ -34,9 +34,27 @@ char event_storage[EVENT_STORAGE_SIZE];
>  EXPORT_SYMBOL_GPL(event_storage);
>  
>  LIST_HEAD(ftrace_events);
> -LIST_HEAD(ftrace_common_fields);
> +static LIST_HEAD(ftrace_common_fields);
>  
> -struct list_head *
> +#define GFP_TRACE (GFP_KERNEL | __GFP_ZERO)
> +
> +static struct kmem_cache *field_cachep;
> +static struct kmem_cache *file_cachep;
> +
> +/* Double loops, do not use break, only goto's work */
> +#define do_for_each_event_file(tr, file)			\
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {	\
> +		list_for_each_entry(file, &tr->events, list)
> +
> +#define do_for_each_event_file_safe(tr, file)			\
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {	\
> +		struct ftrace_event_file *___n;				\
> +		list_for_each_entry_safe(file, ___n, &tr->events, list)
> +
> +#define while_for_each_event_file()		\
> +	}
> +
> +static struct list_head *
>  trace_get_fields(struct ftrace_event_call *event_call)
>  {
>  	if (!event_call->class->get_fields)
> @@ -44,23 +62,45 @@ trace_get_fields(struct ftrace_event_call *event_call)
>  	return event_call->class->get_fields(event_call);
>  }
>  
> +static struct ftrace_event_field *
> +__find_event_field(struct list_head *head, char *name)
> +{
> +	struct ftrace_event_field *field;
> +
> +	list_for_each_entry(field, head, link) {
> +		if (!strcmp(field->name, name))
> +			return field;
> +	}
> +
> +	return NULL;
> +}
> +
> +struct ftrace_event_field *
> +trace_find_event_field(struct ftrace_event_call *call, char *name)
> +{
> +	struct ftrace_event_field *field;
> +	struct list_head *head;
> +
> +	field = __find_event_field(&ftrace_common_fields, name);
> +	if (field)
> +		return field;
> +
> +	head = trace_get_fields(call);
> +	return __find_event_field(head, name);
> +}
> +
>  static int __trace_define_field(struct list_head *head, const char *type,
>  				const char *name, int offset, int size,
>  				int is_signed, int filter_type)
>  {
>  	struct ftrace_event_field *field;
>  
> -	field = kzalloc(sizeof(*field), GFP_KERNEL);
> +	field = kmem_cache_alloc(field_cachep, GFP_TRACE);
>  	if (!field)
>  		goto err;
>  
> -	field->name = kstrdup(name, GFP_KERNEL);
> -	if (!field->name)
> -		goto err;
> -
> -	field->type = kstrdup(type, GFP_KERNEL);
> -	if (!field->type)
> -		goto err;
> +	field->name = name;
> +	field->type = type;
>  
>  	if (filter_type == FILTER_OTHER)
>  		field->filter_type = filter_assign_type(type);
> @@ -76,9 +116,7 @@ static int __trace_define_field(struct list_head *head, const char *type,
>  	return 0;
>  
>  err:
> -	if (field)
> -		kfree(field->name);
> -	kfree(field);
> +	kmem_cache_free(field_cachep, field);
>  
>  	return -ENOMEM;
>  }
> @@ -120,7 +158,7 @@ static int trace_define_common_fields(void)
>  	return ret;
>  }
>  
> -void trace_destroy_fields(struct ftrace_event_call *call)
> +static void trace_destroy_fields(struct ftrace_event_call *call)
>  {
>  	struct ftrace_event_field *field, *next;
>  	struct list_head *head;
> @@ -128,9 +166,7 @@ void trace_destroy_fields(struct ftrace_event_call *call)
>  	head = trace_get_fields(call);
>  	list_for_each_entry_safe(field, next, head, link) {
>  		list_del(&field->link);
> -		kfree(field->type);
> -		kfree(field->name);
> -		kfree(field);
> +		kmem_cache_free(field_cachep, field);
>  	}
>  }
>  
> @@ -149,15 +185,17 @@ EXPORT_SYMBOL_GPL(trace_event_raw_init);
>  int ftrace_event_reg(struct ftrace_event_call *call,
>  		     enum trace_reg type, void *data)
>  {
> +	struct ftrace_event_file *file = data;
> +
>  	switch (type) {
>  	case TRACE_REG_REGISTER:
>  		return tracepoint_probe_register(call->name,
>  						 call->class->probe,
> -						 call);
> +						 file);
>  	case TRACE_REG_UNREGISTER:
>  		tracepoint_probe_unregister(call->name,
>  					    call->class->probe,
> -					    call);
> +					    file);
>  		return 0;
>  
>  #ifdef CONFIG_PERF_EVENTS
> @@ -183,54 +221,100 @@ EXPORT_SYMBOL_GPL(ftrace_event_reg);
>  
>  void trace_event_enable_cmd_record(bool enable)
>  {
> -	struct ftrace_event_call *call;
> +	struct ftrace_event_file *file;
> +	struct trace_array *tr;
>  
>  	mutex_lock(&event_mutex);
> -	list_for_each_entry(call, &ftrace_events, list) {
> -		if (!(call->flags & TRACE_EVENT_FL_ENABLED))
> +	do_for_each_event_file(tr, file) {
> +
> +		if (!(file->flags & FTRACE_EVENT_FL_ENABLED))
>  			continue;
>  
>  		if (enable) {
>  			tracing_start_cmdline_record();
> -			call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
> +			set_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
>  		} else {
>  			tracing_stop_cmdline_record();
> -			call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
> +			clear_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
>  		}
> -	}
> +	} while_for_each_event_file();
>  	mutex_unlock(&event_mutex);
>  }
>  
> -static int ftrace_event_enable_disable(struct ftrace_event_call *call,
> -					int enable)
> +static int __ftrace_event_enable_disable(struct ftrace_event_file *file,
> +					 int enable, int soft_disable)
>  {
> +	struct ftrace_event_call *call = file->event_call;
>  	int ret = 0;
> +	int disable;
>  
>  	switch (enable) {
>  	case 0:
> -		if (call->flags & TRACE_EVENT_FL_ENABLED) {
> -			call->flags &= ~TRACE_EVENT_FL_ENABLED;
> -			if (call->flags & TRACE_EVENT_FL_RECORDED_CMD) {
> +		/*
> +		 * When soft_disable is set and enable is cleared, we want
> +		 * to clear the SOFT_DISABLED flag but leave the event in the
> +		 * state that it was. That is, if the event was enabled and
> +		 * SOFT_DISABLED isn't set, then do nothing. But if SOFT_DISABLED
> +		 * is set we do not want the event to be enabled before we
> +		 * clear the bit.
> +		 *
> +		 * When soft_disable is not set but the SOFT_MODE flag is,
> +		 * we do nothing. Do not disable the tracepoint, otherwise
> +		 * "soft enable"s (clearing the SOFT_DISABLED bit) wont work.
> +		 */
> +		if (soft_disable) {
> +			disable = file->flags & FTRACE_EVENT_FL_SOFT_DISABLED;
> +			clear_bit(FTRACE_EVENT_FL_SOFT_MODE_BIT, &file->flags);
> +		} else
> +			disable = !(file->flags & FTRACE_EVENT_FL_SOFT_MODE);
> +
> +		if (disable && (file->flags & FTRACE_EVENT_FL_ENABLED)) {
> +			clear_bit(FTRACE_EVENT_FL_ENABLED_BIT, &file->flags);
> +			if (file->flags & FTRACE_EVENT_FL_RECORDED_CMD) {
>  				tracing_stop_cmdline_record();
> -				call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
> +				clear_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
>  			}
> -			call->class->reg(call, TRACE_REG_UNREGISTER, NULL);
> +			call->class->reg(call, TRACE_REG_UNREGISTER, file);
>  		}
> +		/* If in SOFT_MODE, just set the SOFT_DISABLE_BIT */
> +		if (file->flags & FTRACE_EVENT_FL_SOFT_MODE)
> +			set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
>  		break;
>  	case 1:
> -		if (!(call->flags & TRACE_EVENT_FL_ENABLED)) {
> +		/*
> +		 * When soft_disable is set and enable is set, we want to
> +		 * register the tracepoint for the event, but leave the event
> +		 * as is. That means, if the event was already enabled, we do
> +		 * nothing (but set SOFT_MODE). If the event is disabled, we
> +		 * set SOFT_DISABLED before enabling the event tracepoint, so
> +		 * it still seems to be disabled.
> +		 */
> +		if (!soft_disable)
> +			clear_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
> +		else
> +			set_bit(FTRACE_EVENT_FL_SOFT_MODE_BIT, &file->flags);
> +
> +		if (!(file->flags & FTRACE_EVENT_FL_ENABLED)) {
> +
> +			/* Keep the event disabled, when going to SOFT_MODE. */
> +			if (soft_disable)
> +				set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &file->flags);
> +
>  			if (trace_flags & TRACE_ITER_RECORD_CMD) {
>  				tracing_start_cmdline_record();
> -				call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
> +				set_bit(FTRACE_EVENT_FL_RECORDED_CMD_BIT, &file->flags);
>  			}
> -			ret = call->class->reg(call, TRACE_REG_REGISTER, NULL);
> +			ret = call->class->reg(call, TRACE_REG_REGISTER, file);
>  			if (ret) {
>  				tracing_stop_cmdline_record();
>  				pr_info("event trace: Could not enable event "
>  					"%s\n", call->name);
>  				break;
>  			}
> -			call->flags |= TRACE_EVENT_FL_ENABLED;
> +			set_bit(FTRACE_EVENT_FL_ENABLED_BIT, &file->flags);
> +
> +			/* WAS_ENABLED gets set but never cleared. */
> +			call->flags |= TRACE_EVENT_FL_WAS_ENABLED;
>  		}
>  		break;
>  	}
> @@ -238,13 +322,19 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
>  	return ret;
>  }
>  
> -static void ftrace_clear_events(void)
> +static int ftrace_event_enable_disable(struct ftrace_event_file *file,
> +				       int enable)
>  {
> -	struct ftrace_event_call *call;
> +	return __ftrace_event_enable_disable(file, enable, 0);
> +}
> +
> +static void ftrace_clear_events(struct trace_array *tr)
> +{
> +	struct ftrace_event_file *file;
>  
>  	mutex_lock(&event_mutex);
> -	list_for_each_entry(call, &ftrace_events, list) {
> -		ftrace_event_enable_disable(call, 0);
> +	list_for_each_entry(file, &tr->events, list) {
> +		ftrace_event_enable_disable(file, 0);
>  	}
>  	mutex_unlock(&event_mutex);
>  }
> @@ -257,11 +347,12 @@ static void __put_system(struct event_subsystem *system)
>  	if (--system->ref_count)
>  		return;
>  
> +	list_del(&system->list);
> +
>  	if (filter) {
>  		kfree(filter->filter_string);
>  		kfree(filter);
>  	}
> -	kfree(system->name);
>  	kfree(system);
>  }
>  
> @@ -271,24 +362,45 @@ static void __get_system(struct event_subsystem *system)
>  	system->ref_count++;
>  }
>  
> -static void put_system(struct event_subsystem *system)
> +static void __get_system_dir(struct ftrace_subsystem_dir *dir)
> +{
> +	WARN_ON_ONCE(dir->ref_count == 0);
> +	dir->ref_count++;
> +	__get_system(dir->subsystem);
> +}
> +
> +static void __put_system_dir(struct ftrace_subsystem_dir *dir)
> +{
> +	WARN_ON_ONCE(dir->ref_count == 0);
> +	/* If the subsystem is about to be freed, the dir must be too */
> +	WARN_ON_ONCE(dir->subsystem->ref_count == 1 && dir->ref_count != 1);
> +
> +	__put_system(dir->subsystem);
> +	if (!--dir->ref_count)
> +		kfree(dir);
> +}
> +
> +static void put_system(struct ftrace_subsystem_dir *dir)
>  {
>  	mutex_lock(&event_mutex);
> -	__put_system(system);
> +	__put_system_dir(dir);
>  	mutex_unlock(&event_mutex);
>  }
>  
>  /*
>   * __ftrace_set_clr_event(NULL, NULL, NULL, set) will set/unset all events.
>   */
> -static int __ftrace_set_clr_event(const char *match, const char *sub,
> -				  const char *event, int set)
> +static int __ftrace_set_clr_event(struct trace_array *tr, const char *match,
> +				  const char *sub, const char *event, int set)
>  {
> +	struct ftrace_event_file *file;
>  	struct ftrace_event_call *call;
>  	int ret = -EINVAL;
>  
>  	mutex_lock(&event_mutex);
> -	list_for_each_entry(call, &ftrace_events, list) {
> +	list_for_each_entry(file, &tr->events, list) {
> +
> +		call = file->event_call;
>  
>  		if (!call->name || !call->class || !call->class->reg)
>  			continue;
> @@ -307,7 +419,7 @@ static int __ftrace_set_clr_event(const char *match, const char *sub,
>  		if (event && strcmp(event, call->name) != 0)
>  			continue;
>  
> -		ftrace_event_enable_disable(call, set);
> +		ftrace_event_enable_disable(file, set);
>  
>  		ret = 0;
>  	}
> @@ -316,7 +428,7 @@ static int __ftrace_set_clr_event(const char *match, const char *sub,
>  	return ret;
>  }
>  
> -static int ftrace_set_clr_event(char *buf, int set)
> +static int ftrace_set_clr_event(struct trace_array *tr, char *buf, int set)
>  {
>  	char *event = NULL, *sub = NULL, *match;
>  
> @@ -344,7 +456,7 @@ static int ftrace_set_clr_event(char *buf, int set)
>  			event = NULL;
>  	}
>  
> -	return __ftrace_set_clr_event(match, sub, event, set);
> +	return __ftrace_set_clr_event(tr, match, sub, event, set);
>  }
>  
>  /**
> @@ -361,7 +473,9 @@ static int ftrace_set_clr_event(char *buf, int set)
>   */
>  int trace_set_clr_event(const char *system, const char *event, int set)
>  {
> -	return __ftrace_set_clr_event(NULL, system, event, set);
> +	struct trace_array *tr = top_trace_array();
> +
> +	return __ftrace_set_clr_event(tr, NULL, system, event, set);
>  }
>  EXPORT_SYMBOL_GPL(trace_set_clr_event);
>  
> @@ -373,6 +487,8 @@ ftrace_event_write(struct file *file, const char __user *ubuf,
>  		   size_t cnt, loff_t *ppos)
>  {
>  	struct trace_parser parser;
> +	struct seq_file *m = file->private_data;
> +	struct trace_array *tr = m->private;
>  	ssize_t read, ret;
>  
>  	if (!cnt)
> @@ -395,7 +511,7 @@ ftrace_event_write(struct file *file, const char __user *ubuf,
>  
>  		parser.buffer[parser.idx] = 0;
>  
> -		ret = ftrace_set_clr_event(parser.buffer + !set, set);
> +		ret = ftrace_set_clr_event(tr, parser.buffer + !set, set);
>  		if (ret)
>  			goto out_put;
>  	}
> @@ -411,17 +527,20 @@ ftrace_event_write(struct file *file, const char __user *ubuf,
>  static void *
>  t_next(struct seq_file *m, void *v, loff_t *pos)
>  {
> -	struct ftrace_event_call *call = v;
> +	struct ftrace_event_file *file = v;
> +	struct ftrace_event_call *call;
> +	struct trace_array *tr = m->private;
>  
>  	(*pos)++;
>  
> -	list_for_each_entry_continue(call, &ftrace_events, list) {
> +	list_for_each_entry_continue(file, &tr->events, list) {
> +		call = file->event_call;
>  		/*
>  		 * The ftrace subsystem is for showing formats only.
>  		 * They can not be enabled or disabled via the event files.
>  		 */
>  		if (call->class && call->class->reg)
> -			return call;
> +			return file;
>  	}
>  
>  	return NULL;
> @@ -429,30 +548,32 @@ t_next(struct seq_file *m, void *v, loff_t *pos)
>  
>  static void *t_start(struct seq_file *m, loff_t *pos)
>  {
> -	struct ftrace_event_call *call;
> +	struct ftrace_event_file *file;
> +	struct trace_array *tr = m->private;
>  	loff_t l;
>  
>  	mutex_lock(&event_mutex);
>  
> -	call = list_entry(&ftrace_events, struct ftrace_event_call, list);
> +	file = list_entry(&tr->events, struct ftrace_event_file, list);
>  	for (l = 0; l <= *pos; ) {
> -		call = t_next(m, call, &l);
> -		if (!call)
> +		file = t_next(m, file, &l);
> +		if (!file)
>  			break;
>  	}
> -	return call;
> +	return file;
>  }
>  
>  static void *
>  s_next(struct seq_file *m, void *v, loff_t *pos)
>  {
> -	struct ftrace_event_call *call = v;
> +	struct ftrace_event_file *file = v;
> +	struct trace_array *tr = m->private;
>  
>  	(*pos)++;
>  
> -	list_for_each_entry_continue(call, &ftrace_events, list) {
> -		if (call->flags & TRACE_EVENT_FL_ENABLED)
> -			return call;
> +	list_for_each_entry_continue(file, &tr->events, list) {
> +		if (file->flags & FTRACE_EVENT_FL_ENABLED)
> +			return file;
>  	}
>  
>  	return NULL;
> @@ -460,23 +581,25 @@ s_next(struct seq_file *m, void *v, loff_t *pos)
>  
>  static void *s_start(struct seq_file *m, loff_t *pos)
>  {
> -	struct ftrace_event_call *call;
> +	struct ftrace_event_file *file;
> +	struct trace_array *tr = m->private;
>  	loff_t l;
>  
>  	mutex_lock(&event_mutex);
>  
> -	call = list_entry(&ftrace_events, struct ftrace_event_call, list);
> +	file = list_entry(&tr->events, struct ftrace_event_file, list);
>  	for (l = 0; l <= *pos; ) {
> -		call = s_next(m, call, &l);
> -		if (!call)
> +		file = s_next(m, file, &l);
> +		if (!file)
>  			break;
>  	}
> -	return call;
> +	return file;
>  }
>  
>  static int t_show(struct seq_file *m, void *v)
>  {
> -	struct ftrace_event_call *call = v;
> +	struct ftrace_event_file *file = v;
> +	struct ftrace_event_call *call = file->event_call;
>  
>  	if (strcmp(call->class->system, TRACE_SYSTEM) != 0)
>  		seq_printf(m, "%s:", call->class->system);
> @@ -494,25 +617,31 @@ static ssize_t
>  event_enable_read(struct file *filp, char __user *ubuf, size_t cnt,
>  		  loff_t *ppos)
>  {
> -	struct ftrace_event_call *call = filp->private_data;
> +	struct ftrace_event_file *file = filp->private_data;
>  	char *buf;
>  
> -	if (call->flags & TRACE_EVENT_FL_ENABLED)
> -		buf = "1\n";
> -	else
> +	if (file->flags & FTRACE_EVENT_FL_ENABLED) {
> +		if (file->flags & FTRACE_EVENT_FL_SOFT_DISABLED)
> +			buf = "0*\n";
> +		else
> +			buf = "1\n";
> +	} else
>  		buf = "0\n";
>  
> -	return simple_read_from_buffer(ubuf, cnt, ppos, buf, 2);
> +	return simple_read_from_buffer(ubuf, cnt, ppos, buf, strlen(buf));
>  }
>  
>  static ssize_t
>  event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  		   loff_t *ppos)
>  {
> -	struct ftrace_event_call *call = filp->private_data;
> +	struct ftrace_event_file *file = filp->private_data;
>  	unsigned long val;
>  	int ret;
>  
> +	if (!file)
> +		return -EINVAL;
> +
>  	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
>  	if (ret)
>  		return ret;
> @@ -525,7 +654,7 @@ event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  	case 0:
>  	case 1:
>  		mutex_lock(&event_mutex);
> -		ret = ftrace_event_enable_disable(call, val);
> +		ret = ftrace_event_enable_disable(file, val);
>  		mutex_unlock(&event_mutex);
>  		break;
>  
> @@ -543,14 +672,18 @@ system_enable_read(struct file *filp, char __user *ubuf, size_t cnt,
>  		   loff_t *ppos)
>  {
>  	const char set_to_char[4] = { '?', '0', '1', 'X' };
> -	struct event_subsystem *system = filp->private_data;
> +	struct ftrace_subsystem_dir *dir = filp->private_data;
> +	struct event_subsystem *system = dir->subsystem;
>  	struct ftrace_event_call *call;
> +	struct ftrace_event_file *file;
> +	struct trace_array *tr = dir->tr;
>  	char buf[2];
>  	int set = 0;
>  	int ret;
>  
>  	mutex_lock(&event_mutex);
> -	list_for_each_entry(call, &ftrace_events, list) {
> +	list_for_each_entry(file, &tr->events, list) {
> +		call = file->event_call;
>  		if (!call->name || !call->class || !call->class->reg)
>  			continue;
>  
> @@ -562,7 +695,7 @@ system_enable_read(struct file *filp, char __user *ubuf, size_t cnt,
>  		 * or if all events or cleared, or if we have
>  		 * a mixture.
>  		 */
> -		set |= (1 << !!(call->flags & TRACE_EVENT_FL_ENABLED));
> +		set |= (1 << !!(file->flags & FTRACE_EVENT_FL_ENABLED));
>  
>  		/*
>  		 * If we have a mixture, no need to look further.
> @@ -584,7 +717,8 @@ static ssize_t
>  system_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  		    loff_t *ppos)
>  {
> -	struct event_subsystem *system = filp->private_data;
> +	struct ftrace_subsystem_dir *dir = filp->private_data;
> +	struct event_subsystem *system = dir->subsystem;
>  	const char *name = NULL;
>  	unsigned long val;
>  	ssize_t ret;
> @@ -607,7 +741,7 @@ system_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  	if (system)
>  		name = system->name;
>  
> -	ret = __ftrace_set_clr_event(NULL, name, NULL, val);
> +	ret = __ftrace_set_clr_event(dir->tr, NULL, name, NULL, val);
>  	if (ret)
>  		goto out;
>  
> @@ -845,43 +979,75 @@ static LIST_HEAD(event_subsystems);
>  static int subsystem_open(struct inode *inode, struct file *filp)
>  {
>  	struct event_subsystem *system = NULL;
> +	struct ftrace_subsystem_dir *dir = NULL; /* Initialize for gcc */
> +	struct trace_array *tr;
>  	int ret;
>  
> -	if (!inode->i_private)
> -		goto skip_search;
> -
>  	/* Make sure the system still exists */
>  	mutex_lock(&event_mutex);
> -	list_for_each_entry(system, &event_subsystems, list) {
> -		if (system == inode->i_private) {
> -			/* Don't open systems with no events */
> -			if (!system->nr_events) {
> -				system = NULL;
> -				break;
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
> +		list_for_each_entry(dir, &tr->systems, list) {
> +			if (dir == inode->i_private) {
> +				/* Don't open systems with no events */
> +				if (dir->nr_events) {
> +					__get_system_dir(dir);
> +					system = dir->subsystem;
> +				}
> +				goto exit_loop;
>  			}
> -			__get_system(system);
> -			break;
>  		}
>  	}
> + exit_loop:
>  	mutex_unlock(&event_mutex);
>  
> -	if (system != inode->i_private)
> +	if (!system)
>  		return -ENODEV;
>  
> - skip_search:
> +	/* Some versions of gcc think dir can be uninitialized here */
> +	WARN_ON(!dir);
> +
> +	ret = tracing_open_generic(inode, filp);
> +	if (ret < 0)
> +		put_system(dir);
> +
> +	return ret;
> +}
> +
> +static int system_tr_open(struct inode *inode, struct file *filp)
> +{
> +	struct ftrace_subsystem_dir *dir;
> +	struct trace_array *tr = inode->i_private;
> +	int ret;
> +
> +	/* Make a temporary dir that has no system but points to tr */
> +	dir = kzalloc(sizeof(*dir), GFP_KERNEL);
> +	if (!dir)
> +		return -ENOMEM;
> +
> +	dir->tr = tr;
> +
>  	ret = tracing_open_generic(inode, filp);
> -	if (ret < 0 && system)
> -		put_system(system);
> +	if (ret < 0)
> +		kfree(dir);
> +
> +	filp->private_data = dir;
>  
>  	return ret;
>  }
>  
>  static int subsystem_release(struct inode *inode, struct file *file)
>  {
> -	struct event_subsystem *system = inode->i_private;
> +	struct ftrace_subsystem_dir *dir = file->private_data;
>  
> -	if (system)
> -		put_system(system);
> +	/*
> +	 * If dir->subsystem is NULL, then this is a temporary
> +	 * descriptor that was made for a trace_array to enable
> +	 * all subsystems.
> +	 */
> +	if (dir->subsystem)
> +		put_system(dir);
> +	else
> +		kfree(dir);
>  
>  	return 0;
>  }
> @@ -890,7 +1056,8 @@ static ssize_t
>  subsystem_filter_read(struct file *filp, char __user *ubuf, size_t cnt,
>  		      loff_t *ppos)
>  {
> -	struct event_subsystem *system = filp->private_data;
> +	struct ftrace_subsystem_dir *dir = filp->private_data;
> +	struct event_subsystem *system = dir->subsystem;
>  	struct trace_seq *s;
>  	int r;
>  
> @@ -915,7 +1082,7 @@ static ssize_t
>  subsystem_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  		       loff_t *ppos)
>  {
> -	struct event_subsystem *system = filp->private_data;
> +	struct ftrace_subsystem_dir *dir = filp->private_data;
>  	char *buf;
>  	int err;
>  
> @@ -932,7 +1099,7 @@ subsystem_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,
>  	}
>  	buf[cnt] = '\0';
>  
> -	err = apply_subsystem_event_filter(system, buf);
> +	err = apply_subsystem_event_filter(dir, buf);
>  	free_page((unsigned long) buf);
>  	if (err < 0)
>  		return err;
> @@ -1041,30 +1208,35 @@ static const struct file_operations ftrace_system_enable_fops = {
>  	.release = subsystem_release,
>  };
>  
> +static const struct file_operations ftrace_tr_enable_fops = {
> +	.open = system_tr_open,
> +	.read = system_enable_read,
> +	.write = system_enable_write,
> +	.llseek = default_llseek,
> +	.release = subsystem_release,
> +};
> +
>  static const struct file_operations ftrace_show_header_fops = {
>  	.open = tracing_open_generic,
>  	.read = show_header,
>  	.llseek = default_llseek,
>  };
>  
> -static struct dentry *event_trace_events_dir(void)
> +static int
> +ftrace_event_open(struct inode *inode, struct file *file,
> +		  const struct seq_operations *seq_ops)
>  {
> -	static struct dentry *d_tracer;
> -	static struct dentry *d_events;
> -
> -	if (d_events)
> -		return d_events;
> -
> -	d_tracer = tracing_init_dentry();
> -	if (!d_tracer)
> -		return NULL;
> +	struct seq_file *m;
> +	int ret;
>  
> -	d_events = debugfs_create_dir("events", d_tracer);
> -	if (!d_events)
> -		pr_warning("Could not create debugfs "
> -			   "'events' directory\n");
> +	ret = seq_open(file, seq_ops);
> +	if (ret < 0)
> +		return ret;
> +	m = file->private_data;
> +	/* copy tr over to seq ops */
> +	m->private = inode->i_private;
>  
> -	return d_events;
> +	return ret;
>  }
>  
>  static int
> @@ -1072,117 +1244,165 @@ ftrace_event_avail_open(struct inode *inode, struct file *file)
>  {
>  	const struct seq_operations *seq_ops = &show_event_seq_ops;
>  
> -	return seq_open(file, seq_ops);
> +	return ftrace_event_open(inode, file, seq_ops);
>  }
>  
>  static int
>  ftrace_event_set_open(struct inode *inode, struct file *file)
>  {
>  	const struct seq_operations *seq_ops = &show_set_event_seq_ops;
> +	struct trace_array *tr = inode->i_private;
>  
>  	if ((file->f_mode & FMODE_WRITE) &&
>  	    (file->f_flags & O_TRUNC))
> -		ftrace_clear_events();
> +		ftrace_clear_events(tr);
> +
> +	return ftrace_event_open(inode, file, seq_ops);
> +}
> +
> +static struct event_subsystem *
> +create_new_subsystem(const char *name)
> +{
> +	struct event_subsystem *system;
> +
> +	/* need to create new entry */
> +	system = kmalloc(sizeof(*system), GFP_KERNEL);
> +	if (!system)
> +		return NULL;
> +
> +	system->ref_count = 1;
> +	system->name = name;
> +
> +	system->filter = NULL;
> +
> +	system->filter = kzalloc(sizeof(struct event_filter), GFP_KERNEL);
> +	if (!system->filter)
> +		goto out_free;
> +
> +	list_add(&system->list, &event_subsystems);
> +
> +	return system;
>  
> -	return seq_open(file, seq_ops);
> + out_free:
> +	kfree(system);
> +	return NULL;
>  }
>  
>  static struct dentry *
> -event_subsystem_dir(const char *name, struct dentry *d_events)
> +event_subsystem_dir(struct trace_array *tr, const char *name,
> +		    struct ftrace_event_file *file, struct dentry *parent)
>  {
> +	struct ftrace_subsystem_dir *dir;
>  	struct event_subsystem *system;
>  	struct dentry *entry;
>  
>  	/* First see if we did not already create this dir */
> -	list_for_each_entry(system, &event_subsystems, list) {
> +	list_for_each_entry(dir, &tr->systems, list) {
> +		system = dir->subsystem;
>  		if (strcmp(system->name, name) == 0) {
> -			system->nr_events++;
> -			return system->entry;
> +			dir->nr_events++;
> +			file->system = dir;
> +			return dir->entry;
>  		}
>  	}
>  
> -	/* need to create new entry */
> -	system = kmalloc(sizeof(*system), GFP_KERNEL);
> -	if (!system) {
> -		pr_warning("No memory to create event subsystem %s\n",
> -			   name);
> -		return d_events;
> +	/* Now see if the system itself exists. */
> +	list_for_each_entry(system, &event_subsystems, list) {
> +		if (strcmp(system->name, name) == 0)
> +			break;
>  	}
> +	/* Reset system variable when not found */
> +	if (&system->list == &event_subsystems)
> +		system = NULL;
>  
> -	system->entry = debugfs_create_dir(name, d_events);
> -	if (!system->entry) {
> -		pr_warning("Could not create event subsystem %s\n",
> -			   name);
> -		kfree(system);
> -		return d_events;
> -	}
> +	dir = kmalloc(sizeof(*dir), GFP_KERNEL);
> +	if (!dir)
> +		goto out_fail;
>  
> -	system->nr_events = 1;
> -	system->ref_count = 1;
> -	system->name = kstrdup(name, GFP_KERNEL);
> -	if (!system->name) {
> -		debugfs_remove(system->entry);
> -		kfree(system);
> -		return d_events;
> +	if (!system) {
> +		system = create_new_subsystem(name);
> +		if (!system)
> +			goto out_free;
> +	} else
> +		__get_system(system);
> +
> +	dir->entry = debugfs_create_dir(name, parent);
> +	if (!dir->entry) {
> +		pr_warning("Failed to create system directory %s\n", name);
> +		__put_system(system);
> +		goto out_free;
>  	}
>  
> -	list_add(&system->list, &event_subsystems);
> -
> -	system->filter = NULL;
> -
> -	system->filter = kzalloc(sizeof(struct event_filter), GFP_KERNEL);
> -	if (!system->filter) {
> -		pr_warning("Could not allocate filter for subsystem "
> -			   "'%s'\n", name);
> -		return system->entry;
> -	}
> +	dir->tr = tr;
> +	dir->ref_count = 1;
> +	dir->nr_events = 1;
> +	dir->subsystem = system;
> +	file->system = dir;
>  
> -	entry = debugfs_create_file("filter", 0644, system->entry, system,
> +	entry = debugfs_create_file("filter", 0644, dir->entry, dir,
>  				    &ftrace_subsystem_filter_fops);
>  	if (!entry) {
>  		kfree(system->filter);
>  		system->filter = NULL;
> -		pr_warning("Could not create debugfs "
> -			   "'%s/filter' entry\n", name);
> +		pr_warning("Could not create debugfs '%s/filter' entry\n", name);
>  	}
>  
> -	trace_create_file("enable", 0644, system->entry, system,
> +	trace_create_file("enable", 0644, dir->entry, dir,
>  			  &ftrace_system_enable_fops);
>  
> -	return system->entry;
> +	list_add(&dir->list, &tr->systems);
> +
> +	return dir->entry;
> +
> + out_free:
> +	kfree(dir);
> + out_fail:
> +	/* Only print this message if failed on memory allocation */
> +	if (!dir || !system)
> +		pr_warning("No memory to create event subsystem %s\n",
> +			   name);
> +	return NULL;
>  }
>  
>  static int
> -event_create_dir(struct ftrace_event_call *call, struct dentry *d_events,
> +event_create_dir(struct dentry *parent,
> +		 struct ftrace_event_file *file,
>  		 const struct file_operations *id,
>  		 const struct file_operations *enable,
>  		 const struct file_operations *filter,
>  		 const struct file_operations *format)
>  {
> +	struct ftrace_event_call *call = file->event_call;
> +	struct trace_array *tr = file->tr;
>  	struct list_head *head;
> +	struct dentry *d_events;
>  	int ret;
>  
>  	/*
>  	 * If the trace point header did not define TRACE_SYSTEM
>  	 * then the system would be called "TRACE_SYSTEM".
>  	 */
> -	if (strcmp(call->class->system, TRACE_SYSTEM) != 0)
> -		d_events = event_subsystem_dir(call->class->system, d_events);
> -
> -	call->dir = debugfs_create_dir(call->name, d_events);
> -	if (!call->dir) {
> -		pr_warning("Could not create debugfs "
> -			   "'%s' directory\n", call->name);
> +	if (strcmp(call->class->system, TRACE_SYSTEM) != 0) {
> +		d_events = event_subsystem_dir(tr, call->class->system, file, parent);
> +		if (!d_events)
> +			return -ENOMEM;
> +	} else
> +		d_events = parent;
> +
> +	file->dir = debugfs_create_dir(call->name, d_events);
> +	if (!file->dir) {
> +		pr_warning("Could not create debugfs '%s' directory\n",
> +			   call->name);
>  		return -1;
>  	}
>  
>  	if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE))
> -		trace_create_file("enable", 0644, call->dir, call,
> +		trace_create_file("enable", 0644, file->dir, file,
>  				  enable);
>  
>  #ifdef CONFIG_PERF_EVENTS
>  	if (call->event.type && call->class->reg)
> -		trace_create_file("id", 0444, call->dir, call,
> +		trace_create_file("id", 0444, file->dir, call,
>  		 		  id);
>  #endif
>  
> @@ -1196,23 +1416,76 @@ event_create_dir(struct ftrace_event_call *call, struct dentry *d_events,
>  		if (ret < 0) {
>  			pr_warning("Could not initialize trace point"
>  				   " events/%s\n", call->name);
> -			return ret;
> +			return -1;
>  		}
>  	}
> -	trace_create_file("filter", 0644, call->dir, call,
> +	trace_create_file("filter", 0644, file->dir, call,
>  			  filter);
>  
> -	trace_create_file("format", 0444, call->dir, call,
> +	trace_create_file("format", 0444, file->dir, call,
>  			  format);
>  
>  	return 0;
>  }
>  
> +static void remove_subsystem(struct ftrace_subsystem_dir *dir)
> +{
> +	if (!dir)
> +		return;
> +
> +	if (!--dir->nr_events) {
> +		debugfs_remove_recursive(dir->entry);
> +		list_del(&dir->list);
> +		__put_system_dir(dir);
> +	}
> +}
> +
> +static void remove_event_from_tracers(struct ftrace_event_call *call)
> +{
> +	struct ftrace_event_file *file;
> +	struct trace_array *tr;
> +
> +	do_for_each_event_file_safe(tr, file) {
> +
> +		if (file->event_call != call)
> +			continue;
> +
> +		list_del(&file->list);
> +		debugfs_remove_recursive(file->dir);
> +		remove_subsystem(file->system);
> +		kmem_cache_free(file_cachep, file);
> +
> +		/*
> +		 * The do_for_each_event_file_safe() is
> +		 * a double loop. After finding the call for this
> +		 * trace_array, we use break to jump to the next
> +		 * trace_array.
> +		 */
> +		break;
> +	} while_for_each_event_file();
> +}
> +
>  static void event_remove(struct ftrace_event_call *call)
>  {
> -	ftrace_event_enable_disable(call, 0);
> +	struct trace_array *tr;
> +	struct ftrace_event_file *file;
> +
> +	do_for_each_event_file(tr, file) {
> +		if (file->event_call != call)
> +			continue;
> +		ftrace_event_enable_disable(file, 0);
> +		/*
> +		 * The do_for_each_event_file() is
> +		 * a double loop. After finding the call for this
> +		 * trace_array, we use break to jump to the next
> +		 * trace_array.
> +		 */
> +		break;
> +	} while_for_each_event_file();
> +
>  	if (call->event.funcs)
>  		__unregister_ftrace_event(&call->event);
> +	remove_event_from_tracers(call);
>  	list_del(&call->list);
>  }
>  
> @@ -1234,82 +1507,99 @@ static int event_init(struct ftrace_event_call *call)
>  }
>  
>  static int
> -__trace_add_event_call(struct ftrace_event_call *call, struct module *mod,
> -		       const struct file_operations *id,
> -		       const struct file_operations *enable,
> -		       const struct file_operations *filter,
> -		       const struct file_operations *format)
> +__register_event(struct ftrace_event_call *call, struct module *mod)
>  {
> -	struct dentry *d_events;
>  	int ret;
>  
>  	ret = event_init(call);
>  	if (ret < 0)
>  		return ret;
>  
> -	d_events = event_trace_events_dir();
> -	if (!d_events)
> -		return -ENOENT;
> -
> -	ret = event_create_dir(call, d_events, id, enable, filter, format);
> -	if (!ret)
> -		list_add(&call->list, &ftrace_events);
> +	list_add(&call->list, &ftrace_events);
>  	call->mod = mod;
>  
> -	return ret;
> +	return 0;
> +}
> +
> +/* Add an event to a trace directory */
> +static int
> +__trace_add_new_event(struct ftrace_event_call *call,
> +		      struct trace_array *tr,
> +		      const struct file_operations *id,
> +		      const struct file_operations *enable,
> +		      const struct file_operations *filter,
> +		      const struct file_operations *format)
> +{
> +	struct ftrace_event_file *file;
> +
> +	file = kmem_cache_alloc(file_cachep, GFP_TRACE);
> +	if (!file)
> +		return -ENOMEM;
> +
> +	file->event_call = call;
> +	file->tr = tr;
> +	list_add(&file->list, &tr->events);
> +
> +	return event_create_dir(tr->event_dir, file, id, enable, filter, format);
> +}
> +
> +/*
> + * Just create a decriptor for early init. A descriptor is required
> + * for enabling events at boot. We want to enable events before
> + * the filesystem is initialized.
> + */
> +static __init int
> +__trace_early_add_new_event(struct ftrace_event_call *call,
> +			    struct trace_array *tr)
> +{
> +	struct ftrace_event_file *file;
> +
> +	file = kmem_cache_alloc(file_cachep, GFP_TRACE);
> +	if (!file)
> +		return -ENOMEM;
> +
> +	file->event_call = call;
> +	file->tr = tr;
> +	list_add(&file->list, &tr->events);
> +
> +	return 0;
>  }
>  
> +struct ftrace_module_file_ops;
> +static void __add_event_to_tracers(struct ftrace_event_call *call,
> +				   struct ftrace_module_file_ops *file_ops);
> +
>  /* Add an additional event_call dynamically */
>  int trace_add_event_call(struct ftrace_event_call *call)
>  {
>  	int ret;
>  	mutex_lock(&event_mutex);
> -	ret = __trace_add_event_call(call, NULL, &ftrace_event_id_fops,
> -				     &ftrace_enable_fops,
> -				     &ftrace_event_filter_fops,
> -				     &ftrace_event_format_fops);
> -	mutex_unlock(&event_mutex);
> -	return ret;
> -}
>  
> -static void remove_subsystem_dir(const char *name)
> -{
> -	struct event_subsystem *system;
> -
> -	if (strcmp(name, TRACE_SYSTEM) == 0)
> -		return;
> +	ret = __register_event(call, NULL);
> +	if (ret >= 0)
> +		__add_event_to_tracers(call, NULL);
>  
> -	list_for_each_entry(system, &event_subsystems, list) {
> -		if (strcmp(system->name, name) == 0) {
> -			if (!--system->nr_events) {
> -				debugfs_remove_recursive(system->entry);
> -				list_del(&system->list);
> -				__put_system(system);
> -			}
> -			break;
> -		}
> -	}
> +	mutex_unlock(&event_mutex);
> +	return ret;
>  }
>  
>  /*
> - * Must be called under locking both of event_mutex and trace_event_mutex.
> + * Must be called under locking both of event_mutex and trace_event_sem.
>   */
>  static void __trace_remove_event_call(struct ftrace_event_call *call)
>  {
>  	event_remove(call);
>  	trace_destroy_fields(call);
>  	destroy_preds(call);
> -	debugfs_remove_recursive(call->dir);
> -	remove_subsystem_dir(call->class->system);
>  }
>  
>  /* Remove an event_call */
>  void trace_remove_event_call(struct ftrace_event_call *call)
>  {
>  	mutex_lock(&event_mutex);
> -	down_write(&trace_event_mutex);
> +	down_write(&trace_event_sem);
>  	__trace_remove_event_call(call);
> -	up_write(&trace_event_mutex);
> +	up_write(&trace_event_sem);
>  	mutex_unlock(&event_mutex);
>  }
>  
> @@ -1336,6 +1626,26 @@ struct ftrace_module_file_ops {
>  };
>  
>  static struct ftrace_module_file_ops *
> +find_ftrace_file_ops(struct ftrace_module_file_ops *file_ops, struct module *mod)
> +{
> +	/*
> +	 * As event_calls are added in groups by module,
> +	 * when we find one file_ops, we don't need to search for
> +	 * each call in that module, as the rest should be the
> +	 * same. Only search for a new one if the last one did
> +	 * not match.
> +	 */
> +	if (file_ops && mod == file_ops->mod)
> +		return file_ops;
> +
> +	list_for_each_entry(file_ops, &ftrace_module_file_list, list) {
> +		if (file_ops->mod == mod)
> +			return file_ops;
> +	}
> +	return NULL;
> +}
> +
> +static struct ftrace_module_file_ops *
>  trace_create_file_ops(struct module *mod)
>  {
>  	struct ftrace_module_file_ops *file_ops;
> @@ -1386,9 +1696,8 @@ static void trace_module_add_events(struct module *mod)
>  		return;
>  
>  	for_each_event(call, start, end) {
> -		__trace_add_event_call(*call, mod,
> -				       &file_ops->id, &file_ops->enable,
> -				       &file_ops->filter, &file_ops->format);
> +		__register_event(*call, mod);
> +		__add_event_to_tracers(*call, file_ops);
>  	}
>  }
>  
> @@ -1396,12 +1705,13 @@ static void trace_module_remove_events(struct module *mod)
>  {
>  	struct ftrace_module_file_ops *file_ops;
>  	struct ftrace_event_call *call, *p;
> -	bool found = false;
> +	bool clear_trace = false;
>  
> -	down_write(&trace_event_mutex);
> +	down_write(&trace_event_sem);
>  	list_for_each_entry_safe(call, p, &ftrace_events, list) {
>  		if (call->mod == mod) {
> -			found = true;
> +			if (call->flags & TRACE_EVENT_FL_WAS_ENABLED)
> +				clear_trace = true;
>  			__trace_remove_event_call(call);
>  		}
>  	}
> @@ -1415,14 +1725,18 @@ static void trace_module_remove_events(struct module *mod)
>  		list_del(&file_ops->list);
>  		kfree(file_ops);
>  	}
> +	up_write(&trace_event_sem);
>  
>  	/*
>  	 * It is safest to reset the ring buffer if the module being unloaded
> -	 * registered any events.
> +	 * registered any events that were used. The only worry is if
> +	 * a new module gets loaded, and takes on the same id as the events
> +	 * of this module. When printing out the buffer, traced events left
> +	 * over from this module may be passed to the new module events and
> +	 * unexpected results may occur.
>  	 */
> -	if (found)
> -		tracing_reset_current_online_cpus();
> -	up_write(&trace_event_mutex);
> +	if (clear_trace)
> +		tracing_reset_all_online_cpus();
>  }
>  
>  static int trace_module_notify(struct notifier_block *self,
> @@ -1443,36 +1757,575 @@ static int trace_module_notify(struct notifier_block *self,
>  
>  	return 0;
>  }
> +
> +static int
> +__trace_add_new_mod_event(struct ftrace_event_call *call,
> +			  struct trace_array *tr,
> +			  struct ftrace_module_file_ops *file_ops)
> +{
> +	return __trace_add_new_event(call, tr,
> +				     &file_ops->id, &file_ops->enable,
> +				     &file_ops->filter, &file_ops->format);
> +}
> +
>  #else
> -static int trace_module_notify(struct notifier_block *self,
> -			       unsigned long val, void *data)
> +static inline struct ftrace_module_file_ops *
> +find_ftrace_file_ops(struct ftrace_module_file_ops *file_ops, struct module *mod)
> +{
> +	return NULL;
> +}
> +static inline int trace_module_notify(struct notifier_block *self,
> +				      unsigned long val, void *data)
>  {
>  	return 0;
>  }
> +static inline int
> +__trace_add_new_mod_event(struct ftrace_event_call *call,
> +			  struct trace_array *tr,
> +			  struct ftrace_module_file_ops *file_ops)
> +{
> +	return -ENODEV;
> +}
>  #endif /* CONFIG_MODULES */
>  
> -static struct notifier_block trace_module_nb = {
> -	.notifier_call = trace_module_notify,
> -	.priority = 0,
> -};
> -
> -extern struct ftrace_event_call *__start_ftrace_events[];
> -extern struct ftrace_event_call *__stop_ftrace_events[];
> -
> -static char bootup_event_buf[COMMAND_LINE_SIZE] __initdata;
> -
> -static __init int setup_trace_event(char *str)
> +/* Create a new event directory structure for a trace directory. */
> +static void
> +__trace_add_event_dirs(struct trace_array *tr)
>  {
> -	strlcpy(bootup_event_buf, str, COMMAND_LINE_SIZE);
> -	ring_buffer_expanded = 1;
> -	tracing_selftest_disabled = 1;
> +	struct ftrace_module_file_ops *file_ops = NULL;
> +	struct ftrace_event_call *call;
> +	int ret;
> +
> +	list_for_each_entry(call, &ftrace_events, list) {
> +		if (call->mod) {
> +			/*
> +			 * Directories for events by modules need to
> +			 * keep module ref counts when opened (as we don't
> +			 * want the module to disappear when reading one
> +			 * of these files). The file_ops keep account of
> +			 * the module ref count.
> +			 */
> +			file_ops = find_ftrace_file_ops(file_ops, call->mod);
> +			if (!file_ops)
> +				continue; /* Warn? */
> +			ret = __trace_add_new_mod_event(call, tr, file_ops);
> +			if (ret < 0)
> +				pr_warning("Could not create directory for event %s\n",
> +					   call->name);
> +			continue;
> +		}
> +		ret = __trace_add_new_event(call, tr,
> +					    &ftrace_event_id_fops,
> +					    &ftrace_enable_fops,
> +					    &ftrace_event_filter_fops,
> +					    &ftrace_event_format_fops);
> +		if (ret < 0)
> +			pr_warning("Could not create directory for event %s\n",
> +				   call->name);
> +	}
> +}
> +
> +#ifdef CONFIG_DYNAMIC_FTRACE
> +
> +/* Avoid typos */
> +#define ENABLE_EVENT_STR	"enable_event"
> +#define DISABLE_EVENT_STR	"disable_event"
> +
> +struct event_probe_data {
> +	struct ftrace_event_file	*file;
> +	unsigned long			count;
> +	int				ref;
> +	bool				enable;
> +};
> +
> +static struct ftrace_event_file *
> +find_event_file(struct trace_array *tr, const char *system,  const char *event)
> +{
> +	struct ftrace_event_file *file;
> +	struct ftrace_event_call *call;
> +
> +	list_for_each_entry(file, &tr->events, list) {
> +
> +		call = file->event_call;
> +
> +		if (!call->name || !call->class || !call->class->reg)
> +			continue;
> +
> +		if (call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)
> +			continue;
> +
> +		if (strcmp(event, call->name) == 0 &&
> +		    strcmp(system, call->class->system) == 0)
> +			return file;
> +	}
> +	return NULL;
> +}
> +
> +static void
> +event_enable_probe(unsigned long ip, unsigned long parent_ip, void **_data)
> +{
> +	struct event_probe_data **pdata = (struct event_probe_data **)_data;
> +	struct event_probe_data *data = *pdata;
> +
> +	if (!data)
> +		return;
> +
> +	if (data->enable)
> +		clear_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &data->file->flags);
> +	else
> +		set_bit(FTRACE_EVENT_FL_SOFT_DISABLED_BIT, &data->file->flags);
> +}
> +
> +static void
> +event_enable_count_probe(unsigned long ip, unsigned long parent_ip, void **_data)
> +{
> +	struct event_probe_data **pdata = (struct event_probe_data **)_data;
> +	struct event_probe_data *data = *pdata;
> +
> +	if (!data)
> +		return;
> +
> +	if (!data->count)
> +		return;
> +
> +	/* Skip if the event is in a state we want to switch to */
> +	if (data->enable == !(data->file->flags & FTRACE_EVENT_FL_SOFT_DISABLED))
> +		return;
> +
> +	if (data->count != -1)
> +		(data->count)--;
> +
> +	event_enable_probe(ip, parent_ip, _data);
> +}
> +
> +static int
> +event_enable_print(struct seq_file *m, unsigned long ip,
> +		      struct ftrace_probe_ops *ops, void *_data)
> +{
> +	struct event_probe_data *data = _data;
> +
> +	seq_printf(m, "%ps:", (void *)ip);
> +
> +	seq_printf(m, "%s:%s:%s",
> +		   data->enable ? ENABLE_EVENT_STR : DISABLE_EVENT_STR,
> +		   data->file->event_call->class->system,
> +		   data->file->event_call->name);
> +
> +	if (data->count == -1)
> +		seq_printf(m, ":unlimited\n");
> +	else
> +		seq_printf(m, ":count=%ld\n", data->count);
> +
> +	return 0;
> +}
> +
> +static int
> +event_enable_init(struct ftrace_probe_ops *ops, unsigned long ip,
> +		  void **_data)
> +{
> +	struct event_probe_data **pdata = (struct event_probe_data **)_data;
> +	struct event_probe_data *data = *pdata;
> +
> +	data->ref++;
> +	return 0;
> +}
> +
> +static void
> +event_enable_free(struct ftrace_probe_ops *ops, unsigned long ip,
> +		  void **_data)
> +{
> +	struct event_probe_data **pdata = (struct event_probe_data **)_data;
> +	struct event_probe_data *data = *pdata;
> +
> +	if (WARN_ON_ONCE(data->ref <= 0))
> +		return;
> +
> +	data->ref--;
> +	if (!data->ref) {
> +		/* Remove the SOFT_MODE flag */
> +		__ftrace_event_enable_disable(data->file, 0, 1);
> +		module_put(data->file->event_call->mod);
> +		kfree(data);
> +	}
> +	*pdata = NULL;
> +}
> +
> +static struct ftrace_probe_ops event_enable_probe_ops = {
> +	.func			= event_enable_probe,
> +	.print			= event_enable_print,
> +	.init			= event_enable_init,
> +	.free			= event_enable_free,
> +};
> +
> +static struct ftrace_probe_ops event_enable_count_probe_ops = {
> +	.func			= event_enable_count_probe,
> +	.print			= event_enable_print,
> +	.init			= event_enable_init,
> +	.free			= event_enable_free,
> +};
> +
> +static struct ftrace_probe_ops event_disable_probe_ops = {
> +	.func			= event_enable_probe,
> +	.print			= event_enable_print,
> +	.init			= event_enable_init,
> +	.free			= event_enable_free,
> +};
> +
> +static struct ftrace_probe_ops event_disable_count_probe_ops = {
> +	.func			= event_enable_count_probe,
> +	.print			= event_enable_print,
> +	.init			= event_enable_init,
> +	.free			= event_enable_free,
> +};
> +
> +static int
> +event_enable_func(struct ftrace_hash *hash,
> +		  char *glob, char *cmd, char *param, int enabled)
> +{
> +	struct trace_array *tr = top_trace_array();
> +	struct ftrace_event_file *file;
> +	struct ftrace_probe_ops *ops;
> +	struct event_probe_data *data;
> +	const char *system;
> +	const char *event;
> +	char *number;
> +	bool enable;
> +	int ret;
> +
> +	/* hash funcs only work with set_ftrace_filter */
> +	if (!enabled)
> +		return -EINVAL;
> +
> +	if (!param)
> +		return -EINVAL;
> +
> +	system = strsep(&param, ":");
> +	if (!param)
> +		return -EINVAL;
> +
> +	event = strsep(&param, ":");
> +
> +	mutex_lock(&event_mutex);
> +
> +	ret = -EINVAL;
> +	file = find_event_file(tr, system, event);
> +	if (!file)
> +		goto out;
> +
> +	enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;
> +
> +	if (enable)
> +		ops = param ? &event_enable_count_probe_ops : &event_enable_probe_ops;
> +	else
> +		ops = param ? &event_disable_count_probe_ops : &event_disable_probe_ops;
> +
> +	if (glob[0] == '!') {
> +		unregister_ftrace_function_probe_func(glob+1, ops);
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	ret = -ENOMEM;
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		goto out;
> +
> +	data->enable = enable;
> +	data->count = -1;
> +	data->file = file;
> +
> +	if (!param)
> +		goto out_reg;
> +
> +	number = strsep(&param, ":");
> +
> +	ret = -EINVAL;
> +	if (!strlen(number))
> +		goto out_free;
> +
> +	/*
> +	 * We use the callback data field (which is a pointer)
> +	 * as our counter.
> +	 */
> +	ret = kstrtoul(number, 0, &data->count);
> +	if (ret)
> +		goto out_free;
> +
> + out_reg:
> +	/* Don't let event modules unload while probe registered */
> +	ret = try_module_get(file->event_call->mod);
> +	if (!ret)
> +		goto out_free;
> +
> +	ret = __ftrace_event_enable_disable(file, 1, 1);
> +	if (ret < 0)
> +		goto out_put;
> +	ret = register_ftrace_function_probe(glob, ops, data);
> +	if (!ret)
> +		goto out_disable;
> + out:
> +	mutex_unlock(&event_mutex);
> +	return ret;
> +
> + out_disable:
> +	__ftrace_event_enable_disable(file, 0, 1);
> + out_put:
> +	module_put(file->event_call->mod);
> + out_free:
> +	kfree(data);
> +	goto out;
> +}
> +
> +static struct ftrace_func_command event_enable_cmd = {
> +	.name			= ENABLE_EVENT_STR,
> +	.func			= event_enable_func,
> +};
> +
> +static struct ftrace_func_command event_disable_cmd = {
> +	.name			= DISABLE_EVENT_STR,
> +	.func			= event_enable_func,
> +};
> +
> +static __init int register_event_cmds(void)
> +{
> +	int ret;
> +
> +	ret = register_ftrace_command(&event_enable_cmd);
> +	if (WARN_ON(ret < 0))
> +		return ret;
> +	ret = register_ftrace_command(&event_disable_cmd);
> +	if (WARN_ON(ret < 0))
> +		unregister_ftrace_command(&event_enable_cmd);
> +	return ret;
> +}
> +#else
> +static inline int register_event_cmds(void) { return 0; }
> +#endif /* CONFIG_DYNAMIC_FTRACE */
> +
> +/*
> + * The top level array has already had its ftrace_event_file
> + * descriptors created in order to allow for early events to
> + * be recorded. This function is called after the debugfs has been
> + * initialized, and we now have to create the files associated
> + * to the events.
> + */
> +static __init void
> +__trace_early_add_event_dirs(struct trace_array *tr)
> +{
> +	struct ftrace_event_file *file;
> +	int ret;
> +
> +
> +	list_for_each_entry(file, &tr->events, list) {
> +		ret = event_create_dir(tr->event_dir, file,
> +				       &ftrace_event_id_fops,
> +				       &ftrace_enable_fops,
> +				       &ftrace_event_filter_fops,
> +				       &ftrace_event_format_fops);
> +		if (ret < 0)
> +			pr_warning("Could not create directory for event %s\n",
> +				   file->event_call->name);
> +	}
> +}
> +
> +/*
> + * For early boot up, the top trace array requires to have
> + * a list of events that can be enabled. This must be done before
> + * the filesystem is set up in order to allow events to be traced
> + * early.
> + */
> +static __init void
> +__trace_early_add_events(struct trace_array *tr)
> +{
> +	struct ftrace_event_call *call;
> +	int ret;
> +
> +	list_for_each_entry(call, &ftrace_events, list) {
> +		/* Early boot up should not have any modules loaded */
> +		if (WARN_ON_ONCE(call->mod))
> +			continue;
> +
> +		ret = __trace_early_add_new_event(call, tr);
> +		if (ret < 0)
> +			pr_warning("Could not create early event %s\n",
> +				   call->name);
> +	}
> +}
> +
> +/* Remove the event directory structure for a trace directory. */
> +static void
> +__trace_remove_event_dirs(struct trace_array *tr)
> +{
> +	struct ftrace_event_file *file, *next;
> +
> +	list_for_each_entry_safe(file, next, &tr->events, list) {
> +		list_del(&file->list);
> +		debugfs_remove_recursive(file->dir);
> +		remove_subsystem(file->system);
> +		kmem_cache_free(file_cachep, file);
> +	}
> +}
> +
> +static void
> +__add_event_to_tracers(struct ftrace_event_call *call,
> +		       struct ftrace_module_file_ops *file_ops)
> +{
> +	struct trace_array *tr;
> +
> +	list_for_each_entry(tr, &ftrace_trace_arrays, list) {
> +		if (file_ops)
> +			__trace_add_new_mod_event(call, tr, file_ops);
> +		else
> +			__trace_add_new_event(call, tr,
> +					      &ftrace_event_id_fops,
> +					      &ftrace_enable_fops,
> +					      &ftrace_event_filter_fops,
> +					      &ftrace_event_format_fops);
> +	}
> +}
> +
> +static struct notifier_block trace_module_nb = {
> +	.notifier_call = trace_module_notify,
> +	.priority = 0,
> +};
> +
> +extern struct ftrace_event_call *__start_ftrace_events[];
> +extern struct ftrace_event_call *__stop_ftrace_events[];
> +
> +static char bootup_event_buf[COMMAND_LINE_SIZE] __initdata;
> +
> +static __init int setup_trace_event(char *str)
> +{
> +	strlcpy(bootup_event_buf, str, COMMAND_LINE_SIZE);
> +	ring_buffer_expanded = true;
> +	tracing_selftest_disabled = true;
>  
>  	return 1;
>  }
>  __setup("trace_event=", setup_trace_event);
>  
> +/* Expects to have event_mutex held when called */
> +static int
> +create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
> +{
> +	struct dentry *d_events;
> +	struct dentry *entry;
> +
> +	entry = debugfs_create_file("set_event", 0644, parent,
> +				    tr, &ftrace_set_event_fops);
> +	if (!entry) {
> +		pr_warning("Could not create debugfs 'set_event' entry\n");
> +		return -ENOMEM;
> +	}
> +
> +	d_events = debugfs_create_dir("events", parent);
> +	if (!d_events) {
> +		pr_warning("Could not create debugfs 'events' directory\n");
> +		return -ENOMEM;
> +	}
> +
> +	/* ring buffer internal formats */
> +	trace_create_file("header_page", 0444, d_events,
> +			  ring_buffer_print_page_header,
> +			  &ftrace_show_header_fops);
> +
> +	trace_create_file("header_event", 0444, d_events,
> +			  ring_buffer_print_entry_header,
> +			  &ftrace_show_header_fops);
> +
> +	trace_create_file("enable", 0644, d_events,
> +			  tr, &ftrace_tr_enable_fops);
> +
> +	tr->event_dir = d_events;
> +
> +	return 0;
> +}
> +
> +/**
> + * event_trace_add_tracer - add a instance of a trace_array to events
> + * @parent: The parent dentry to place the files/directories for events in
> + * @tr: The trace array associated with these events
> + *
> + * When a new instance is created, it needs to set up its events
> + * directory, as well as other files associated with events. It also
> + * creates the event hierachry in the @parent/events directory.
> + *
> + * Returns 0 on success.
> + */
> +int event_trace_add_tracer(struct dentry *parent, struct trace_array *tr)
> +{
> +	int ret;
> +
> +	mutex_lock(&event_mutex);
> +
> +	ret = create_event_toplevel_files(parent, tr);
> +	if (ret)
> +		goto out_unlock;
> +
> +	down_write(&trace_event_sem);
> +	__trace_add_event_dirs(tr);
> +	up_write(&trace_event_sem);
> +
> + out_unlock:
> +	mutex_unlock(&event_mutex);
> +
> +	return ret;
> +}
> +
> +/*
> + * The top trace array already had its file descriptors created.
> + * Now the files themselves need to be created.
> + */
> +static __init int
> +early_event_add_tracer(struct dentry *parent, struct trace_array *tr)
> +{
> +	int ret;
> +
> +	mutex_lock(&event_mutex);
> +
> +	ret = create_event_toplevel_files(parent, tr);
> +	if (ret)
> +		goto out_unlock;
> +
> +	down_write(&trace_event_sem);
> +	__trace_early_add_event_dirs(tr);
> +	up_write(&trace_event_sem);
> +
> + out_unlock:
> +	mutex_unlock(&event_mutex);
> +
> +	return ret;
> +}
> +
> +int event_trace_del_tracer(struct trace_array *tr)
> +{
> +	/* Disable any running events */
> +	__ftrace_set_clr_event(tr, NULL, NULL, NULL, 0);
> +
> +	mutex_lock(&event_mutex);
> +
> +	down_write(&trace_event_sem);
> +	__trace_remove_event_dirs(tr);
> +	debugfs_remove_recursive(tr->event_dir);
> +	up_write(&trace_event_sem);
> +
> +	tr->event_dir = NULL;
> +
> +	mutex_unlock(&event_mutex);
> +
> +	return 0;
> +}
> +
> +static __init int event_trace_memsetup(void)
> +{
> +	field_cachep = KMEM_CACHE(ftrace_event_field, SLAB_PANIC);
> +	file_cachep = KMEM_CACHE(ftrace_event_file, SLAB_PANIC);
> +	return 0;
> +}
> +
>  static __init int event_trace_enable(void)
>  {
> +	struct trace_array *tr = top_trace_array();
>  	struct ftrace_event_call **iter, *call;
>  	char *buf = bootup_event_buf;
>  	char *token;
> @@ -1486,6 +2339,14 @@ static __init int event_trace_enable(void)
>  			list_add(&call->list, &ftrace_events);
>  	}
>  
> +	/*
> +	 * We need the top trace array to have a working set of trace
> +	 * points at early init, before the debug files and directories
> +	 * are created. Create the file entries now, and attach them
> +	 * to the actual file dentries later.
> +	 */
> +	__trace_early_add_events(tr);
> +
>  	while (true) {
>  		token = strsep(&buf, ",");
>  
> @@ -1494,73 +2355,43 @@ static __init int event_trace_enable(void)
>  		if (!*token)
>  			continue;
>  
> -		ret = ftrace_set_clr_event(token, 1);
> +		ret = ftrace_set_clr_event(tr, token, 1);
>  		if (ret)
>  			pr_warn("Failed to enable trace event: %s\n", token);
>  	}
>  
>  	trace_printk_start_comm();
>  
> +	register_event_cmds();
> +
>  	return 0;
>  }
>  
>  static __init int event_trace_init(void)
>  {
> -	struct ftrace_event_call *call;
> +	struct trace_array *tr;
>  	struct dentry *d_tracer;
>  	struct dentry *entry;
> -	struct dentry *d_events;
>  	int ret;
>  
> +	tr = top_trace_array();
> +
>  	d_tracer = tracing_init_dentry();
>  	if (!d_tracer)
>  		return 0;
>  
>  	entry = debugfs_create_file("available_events", 0444, d_tracer,
> -				    NULL, &ftrace_avail_fops);
> +				    tr, &ftrace_avail_fops);
>  	if (!entry)
>  		pr_warning("Could not create debugfs "
>  			   "'available_events' entry\n");
>  
> -	entry = debugfs_create_file("set_event", 0644, d_tracer,
> -				    NULL, &ftrace_set_event_fops);
> -	if (!entry)
> -		pr_warning("Could not create debugfs "
> -			   "'set_event' entry\n");
> -
> -	d_events = event_trace_events_dir();
> -	if (!d_events)
> -		return 0;
> -
> -	/* ring buffer internal formats */
> -	trace_create_file("header_page", 0444, d_events,
> -			  ring_buffer_print_page_header,
> -			  &ftrace_show_header_fops);
> -
> -	trace_create_file("header_event", 0444, d_events,
> -			  ring_buffer_print_entry_header,
> -			  &ftrace_show_header_fops);
> -
> -	trace_create_file("enable", 0644, d_events,
> -			  NULL, &ftrace_system_enable_fops);
> -
>  	if (trace_define_common_fields())
>  		pr_warning("tracing: Failed to allocate common fields");
>  
> -	/*
> -	 * Early initialization already enabled ftrace event.
> -	 * Now it's only necessary to create the event directory.
> -	 */
> -	list_for_each_entry(call, &ftrace_events, list) {
> -
> -		ret = event_create_dir(call, d_events,
> -				       &ftrace_event_id_fops,
> -				       &ftrace_enable_fops,
> -				       &ftrace_event_filter_fops,
> -				       &ftrace_event_format_fops);
> -		if (ret < 0)
> -			event_remove(call);
> -	}
> +	ret = early_event_add_tracer(d_tracer, tr);
> +	if (ret)
> +		return ret;
>  
>  	ret = register_module_notifier(&trace_module_nb);
>  	if (ret)
> @@ -1568,6 +2399,7 @@ static __init int event_trace_init(void)
>  
>  	return 0;
>  }
> +early_initcall(event_trace_memsetup);
>  core_initcall(event_trace_enable);
>  fs_initcall(event_trace_init);
>  
> @@ -1627,13 +2459,20 @@ static __init void event_test_stuff(void)
>   */
>  static __init void event_trace_self_tests(void)
>  {
> +	struct ftrace_subsystem_dir *dir;
> +	struct ftrace_event_file *file;
>  	struct ftrace_event_call *call;
>  	struct event_subsystem *system;
> +	struct trace_array *tr;
>  	int ret;
>  
> +	tr = top_trace_array();
> +
>  	pr_info("Running tests on trace events:\n");
>  
> -	list_for_each_entry(call, &ftrace_events, list) {
> +	list_for_each_entry(file, &tr->events, list) {
> +
> +		call = file->event_call;
>  
>  		/* Only test those that have a probe */
>  		if (!call->class || !call->class->probe)
> @@ -1657,15 +2496,15 @@ static __init void event_trace_self_tests(void)
>  		 * If an event is already enabled, someone is using
>  		 * it and the self test should not be on.
>  		 */
> -		if (call->flags & TRACE_EVENT_FL_ENABLED) {
> +		if (file->flags & FTRACE_EVENT_FL_ENABLED) {
>  			pr_warning("Enabled event during self test!\n");
>  			WARN_ON_ONCE(1);
>  			continue;
>  		}
>  
> -		ftrace_event_enable_disable(call, 1);
> +		ftrace_event_enable_disable(file, 1);
>  		event_test_stuff();
> -		ftrace_event_enable_disable(call, 0);
> +		ftrace_event_enable_disable(file, 0);
>  
>  		pr_cont("OK\n");
>  	}
> @@ -1674,7 +2513,9 @@ static __init void event_trace_self_tests(void)
>  
>  	pr_info("Running tests on trace event systems:\n");
>  
> -	list_for_each_entry(system, &event_subsystems, list) {
> +	list_for_each_entry(dir, &tr->systems, list) {
> +
> +		system = dir->subsystem;
>  
>  		/* the ftrace system is special, skip it */
>  		if (strcmp(system->name, "ftrace") == 0)
> @@ -1682,7 +2523,7 @@ static __init void event_trace_self_tests(void)
>  
>  		pr_info("Testing event system %s: ", system->name);
>  
> -		ret = __ftrace_set_clr_event(NULL, system->name, NULL, 1);
> +		ret = __ftrace_set_clr_event(tr, NULL, system->name, NULL, 1);
>  		if (WARN_ON_ONCE(ret)) {
>  			pr_warning("error enabling system %s\n",
>  				   system->name);
> @@ -1691,7 +2532,7 @@ static __init void event_trace_self_tests(void)
>  
>  		event_test_stuff();
>  
> -		ret = __ftrace_set_clr_event(NULL, system->name, NULL, 0);
> +		ret = __ftrace_set_clr_event(tr, NULL, system->name, NULL, 0);
>  		if (WARN_ON_ONCE(ret)) {
>  			pr_warning("error disabling system %s\n",
>  				   system->name);
> @@ -1706,7 +2547,7 @@ static __init void event_trace_self_tests(void)
>  	pr_info("Running tests on all trace events:\n");
>  	pr_info("Testing all events: ");
>  
> -	ret = __ftrace_set_clr_event(NULL, NULL, NULL, 1);
> +	ret = __ftrace_set_clr_event(tr, NULL, NULL, NULL, 1);
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error enabling all events\n");
>  		return;
> @@ -1715,7 +2556,7 @@ static __init void event_trace_self_tests(void)
>  	event_test_stuff();
>  
>  	/* reset sysname */
> -	ret = __ftrace_set_clr_event(NULL, NULL, NULL, 0);
> +	ret = __ftrace_set_clr_event(tr, NULL, NULL, NULL, 0);
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error disabling all events\n");
>  		return;
> diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
> index e5b0ca8..a636117 100644
> --- a/kernel/trace/trace_events_filter.c
> +++ b/kernel/trace/trace_events_filter.c
> @@ -658,33 +658,6 @@ void print_subsystem_event_filter(struct event_subsystem *system,
>  	mutex_unlock(&event_mutex);
>  }
>  
> -static struct ftrace_event_field *
> -__find_event_field(struct list_head *head, char *name)
> -{
> -	struct ftrace_event_field *field;
> -
> -	list_for_each_entry(field, head, link) {
> -		if (!strcmp(field->name, name))
> -			return field;
> -	}
> -
> -	return NULL;
> -}
> -
> -static struct ftrace_event_field *
> -find_event_field(struct ftrace_event_call *call, char *name)
> -{
> -	struct ftrace_event_field *field;
> -	struct list_head *head;
> -
> -	field = __find_event_field(&ftrace_common_fields, name);
> -	if (field)
> -		return field;
> -
> -	head = trace_get_fields(call);
> -	return __find_event_field(head, name);
> -}
> -
>  static int __alloc_pred_stack(struct pred_stack *stack, int n_preds)
>  {
>  	stack->preds = kcalloc(n_preds + 1, sizeof(*stack->preds), GFP_KERNEL);
> @@ -1337,7 +1310,7 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
>  		return NULL;
>  	}
>  
> -	field = find_event_field(call, operand1);
> +	field = trace_find_event_field(call, operand1);
>  	if (!field) {
>  		parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0);
>  		return NULL;
> @@ -1907,16 +1880,17 @@ out_unlock:
>  	return err;
>  }
>  
> -int apply_subsystem_event_filter(struct event_subsystem *system,
> +int apply_subsystem_event_filter(struct ftrace_subsystem_dir *dir,
>  				 char *filter_string)
>  {
> +	struct event_subsystem *system = dir->subsystem;
>  	struct event_filter *filter;
>  	int err = 0;
>  
>  	mutex_lock(&event_mutex);
>  
>  	/* Make sure the system still has events */
> -	if (!system->nr_events) {
> +	if (!dir->nr_events) {
>  		err = -ENODEV;
>  		goto out_unlock;
>  	}
> diff --git a/kernel/trace/trace_export.c b/kernel/trace/trace_export.c
> index e039906..d21a746 100644
> --- a/kernel/trace/trace_export.c
> +++ b/kernel/trace/trace_export.c
> @@ -129,7 +129,7 @@ static void __always_unused ____ftrace_check_##name(void)		\
>  
>  #undef FTRACE_ENTRY
>  #define FTRACE_ENTRY(name, struct_name, id, tstruct, print, filter)	\
> -int									\
> +static int __init							\
>  ftrace_define_fields_##name(struct ftrace_event_call *event_call)	\
>  {									\
>  	struct struct_name field;					\
> @@ -168,7 +168,7 @@ ftrace_define_fields_##name(struct ftrace_event_call *event_call)	\
>  #define FTRACE_ENTRY_REG(call, struct_name, etype, tstruct, print, filter,\
>  			 regfn)						\
>  									\
> -struct ftrace_event_class event_class_ftrace_##call = {			\
> +struct ftrace_event_class __refdata event_class_ftrace_##call = {	\
>  	.system			= __stringify(TRACE_SYSTEM),		\
>  	.define_fields		= ftrace_define_fields_##call,		\
>  	.fields			= LIST_HEAD_INIT(event_class_ftrace_##call.fields),\
> diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
> index 6011525..c4d6d71 100644
> --- a/kernel/trace/trace_functions.c
> +++ b/kernel/trace/trace_functions.c
> @@ -28,7 +28,7 @@ static void tracing_stop_function_trace(void);
>  static int function_trace_init(struct trace_array *tr)
>  {
>  	func_trace = tr;
> -	tr->cpu = get_cpu();
> +	tr->trace_buffer.cpu = get_cpu();
>  	put_cpu();
>  
>  	tracing_start_cmdline_record();
> @@ -44,7 +44,7 @@ static void function_trace_reset(struct trace_array *tr)
>  
>  static void function_trace_start(struct trace_array *tr)
>  {
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  }
>  
>  /* Our option */
> @@ -76,7 +76,7 @@ function_trace_call(unsigned long ip, unsigned long parent_ip,
>  		goto out;
>  
>  	cpu = smp_processor_id();
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	if (!atomic_read(&data->disabled)) {
>  		local_save_flags(flags);
>  		trace_function(tr, ip, parent_ip, flags, pc);
> @@ -107,7 +107,7 @@ function_stack_trace_call(unsigned long ip, unsigned long parent_ip,
>  	 */
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	disabled = atomic_inc_return(&data->disabled);
>  
>  	if (likely(disabled == 1)) {
> @@ -214,66 +214,89 @@ static struct tracer function_trace __read_mostly =
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE
> -static void
> -ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
> +static int update_count(void **data)
>  {
> -	long *count = (long *)data;
> -
> -	if (tracing_is_on())
> -		return;
> +	unsigned long *count = (long *)data;
>  
>  	if (!*count)
> -		return;
> +		return 0;
>  
>  	if (*count != -1)
>  		(*count)--;
>  
> -	tracing_on();
> +	return 1;
>  }
>  
>  static void
> -ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
> +ftrace_traceon_count(unsigned long ip, unsigned long parent_ip, void **data)
>  {
> -	long *count = (long *)data;
> +	if (tracing_is_on())
> +		return;
> +
> +	if (update_count(data))
> +		tracing_on();
> +}
>  
> +static void
> +ftrace_traceoff_count(unsigned long ip, unsigned long parent_ip, void **data)
> +{
>  	if (!tracing_is_on())
>  		return;
>  
> -	if (!*count)
> +	if (update_count(data))
> +		tracing_off();
> +}
> +
> +static void
> +ftrace_traceon(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	if (tracing_is_on())
>  		return;
>  
> -	if (*count != -1)
> -		(*count)--;
> +	tracing_on();
> +}
> +
> +static void
> +ftrace_traceoff(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	if (!tracing_is_on())
> +		return;
>  
>  	tracing_off();
>  }
>  
> -static int
> -ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
> -			 struct ftrace_probe_ops *ops, void *data);
> +/*
> + * Skip 4:
> + *   ftrace_stacktrace()
> + *   function_trace_probe_call()
> + *   ftrace_ops_list_func()
> + *   ftrace_call()
> + */
> +#define STACK_SKIP 4
>  
> -static struct ftrace_probe_ops traceon_probe_ops = {
> -	.func			= ftrace_traceon,
> -	.print			= ftrace_trace_onoff_print,
> -};
> +static void
> +ftrace_stacktrace(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	trace_dump_stack(STACK_SKIP);
> +}
>  
> -static struct ftrace_probe_ops traceoff_probe_ops = {
> -	.func			= ftrace_traceoff,
> -	.print			= ftrace_trace_onoff_print,
> -};
> +static void
> +ftrace_stacktrace_count(unsigned long ip, unsigned long parent_ip, void **data)
> +{
> +	if (!tracing_is_on())
> +		return;
> +
> +	if (update_count(data))
> +		trace_dump_stack(STACK_SKIP);
> +}
>  
>  static int
> -ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
> -			 struct ftrace_probe_ops *ops, void *data)
> +ftrace_probe_print(const char *name, struct seq_file *m,
> +		   unsigned long ip, void *data)
>  {
>  	long count = (long)data;
>  
> -	seq_printf(m, "%ps:", (void *)ip);
> -
> -	if (ops == &traceon_probe_ops)
> -		seq_printf(m, "traceon");
> -	else
> -		seq_printf(m, "traceoff");
> +	seq_printf(m, "%ps:%s", (void *)ip, name);
>  
>  	if (count == -1)
>  		seq_printf(m, ":unlimited\n");
> @@ -284,26 +307,61 @@ ftrace_trace_onoff_print(struct seq_file *m, unsigned long ip,
>  }
>  
>  static int
> -ftrace_trace_onoff_unreg(char *glob, char *cmd, char *param)
> +ftrace_traceon_print(struct seq_file *m, unsigned long ip,
> +			 struct ftrace_probe_ops *ops, void *data)
>  {
> -	struct ftrace_probe_ops *ops;
> -
> -	/* we register both traceon and traceoff to this callback */
> -	if (strcmp(cmd, "traceon") == 0)
> -		ops = &traceon_probe_ops;
> -	else
> -		ops = &traceoff_probe_ops;
> +	return ftrace_probe_print("traceon", m, ip, data);
> +}
>  
> -	unregister_ftrace_function_probe_func(glob, ops);
> +static int
> +ftrace_traceoff_print(struct seq_file *m, unsigned long ip,
> +			 struct ftrace_probe_ops *ops, void *data)
> +{
> +	return ftrace_probe_print("traceoff", m, ip, data);
> +}
>  
> -	return 0;
> +static int
> +ftrace_stacktrace_print(struct seq_file *m, unsigned long ip,
> +			struct ftrace_probe_ops *ops, void *data)
> +{
> +	return ftrace_probe_print("stacktrace", m, ip, data);
>  }
>  
> +static struct ftrace_probe_ops traceon_count_probe_ops = {
> +	.func			= ftrace_traceon_count,
> +	.print			= ftrace_traceon_print,
> +};
> +
> +static struct ftrace_probe_ops traceoff_count_probe_ops = {
> +	.func			= ftrace_traceoff_count,
> +	.print			= ftrace_traceoff_print,
> +};
> +
> +static struct ftrace_probe_ops stacktrace_count_probe_ops = {
> +	.func			= ftrace_stacktrace_count,
> +	.print			= ftrace_stacktrace_print,
> +};
> +
> +static struct ftrace_probe_ops traceon_probe_ops = {
> +	.func			= ftrace_traceon,
> +	.print			= ftrace_traceon_print,
> +};
> +
> +static struct ftrace_probe_ops traceoff_probe_ops = {
> +	.func			= ftrace_traceoff,
> +	.print			= ftrace_traceoff_print,
> +};
> +
> +static struct ftrace_probe_ops stacktrace_probe_ops = {
> +	.func			= ftrace_stacktrace,
> +	.print			= ftrace_stacktrace_print,
> +};
> +
>  static int
> -ftrace_trace_onoff_callback(struct ftrace_hash *hash,
> -			    char *glob, char *cmd, char *param, int enable)
> +ftrace_trace_probe_callback(struct ftrace_probe_ops *ops,
> +			    struct ftrace_hash *hash, char *glob,
> +			    char *cmd, char *param, int enable)
>  {
> -	struct ftrace_probe_ops *ops;
>  	void *count = (void *)-1;
>  	char *number;
>  	int ret;
> @@ -312,14 +370,10 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
>  	if (!enable)
>  		return -EINVAL;
>  
> -	if (glob[0] == '!')
> -		return ftrace_trace_onoff_unreg(glob+1, cmd, param);
> -
> -	/* we register both traceon and traceoff to this callback */
> -	if (strcmp(cmd, "traceon") == 0)
> -		ops = &traceon_probe_ops;
> -	else
> -		ops = &traceoff_probe_ops;
> +	if (glob[0] == '!') {
> +		unregister_ftrace_function_probe_func(glob+1, ops);
> +		return 0;
> +	}
>  
>  	if (!param)
>  		goto out_reg;
> @@ -343,6 +397,34 @@ ftrace_trace_onoff_callback(struct ftrace_hash *hash,
>  	return ret < 0 ? ret : 0;
>  }
>  
> +static int
> +ftrace_trace_onoff_callback(struct ftrace_hash *hash,
> +			    char *glob, char *cmd, char *param, int enable)
> +{
> +	struct ftrace_probe_ops *ops;
> +
> +	/* we register both traceon and traceoff to this callback */
> +	if (strcmp(cmd, "traceon") == 0)
> +		ops = param ? &traceon_count_probe_ops : &traceon_probe_ops;
> +	else
> +		ops = param ? &traceoff_count_probe_ops : &traceoff_probe_ops;
> +
> +	return ftrace_trace_probe_callback(ops, hash, glob, cmd,
> +					   param, enable);
> +}
> +
> +static int
> +ftrace_stacktrace_callback(struct ftrace_hash *hash,
> +			   char *glob, char *cmd, char *param, int enable)
> +{
> +	struct ftrace_probe_ops *ops;
> +
> +	ops = param ? &stacktrace_count_probe_ops : &stacktrace_probe_ops;
> +
> +	return ftrace_trace_probe_callback(ops, hash, glob, cmd,
> +					   param, enable);
> +}
> +
>  static struct ftrace_func_command ftrace_traceon_cmd = {
>  	.name			= "traceon",
>  	.func			= ftrace_trace_onoff_callback,
> @@ -353,6 +435,11 @@ static struct ftrace_func_command ftrace_traceoff_cmd = {
>  	.func			= ftrace_trace_onoff_callback,
>  };
>  
> +static struct ftrace_func_command ftrace_stacktrace_cmd = {
> +	.name			= "stacktrace",
> +	.func			= ftrace_stacktrace_callback,
> +};
> +
>  static int __init init_func_cmd_traceon(void)
>  {
>  	int ret;
> @@ -364,6 +451,12 @@ static int __init init_func_cmd_traceon(void)
>  	ret = register_ftrace_command(&ftrace_traceon_cmd);
>  	if (ret)
>  		unregister_ftrace_command(&ftrace_traceoff_cmd);
> +
> +	ret = register_ftrace_command(&ftrace_stacktrace_cmd);
> +	if (ret) {
> +		unregister_ftrace_command(&ftrace_traceoff_cmd);
> +		unregister_ftrace_command(&ftrace_traceon_cmd);
> +	}
>  	return ret;
>  }
>  #else
> diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
> index 39ada66..8388bc9 100644
> --- a/kernel/trace/trace_functions_graph.c
> +++ b/kernel/trace/trace_functions_graph.c
> @@ -218,7 +218,7 @@ int __trace_graph_entry(struct trace_array *tr,
>  {
>  	struct ftrace_event_call *call = &event_funcgraph_entry;
>  	struct ring_buffer_event *event;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ftrace_graph_ent_entry *entry;
>  
>  	if (unlikely(__this_cpu_read(ftrace_cpu_disabled)))
> @@ -265,7 +265,7 @@ int trace_graph_entry(struct ftrace_graph_ent *trace)
>  
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	disabled = atomic_inc_return(&data->disabled);
>  	if (likely(disabled == 1)) {
>  		pc = preempt_count();
> @@ -323,7 +323,7 @@ void __trace_graph_return(struct trace_array *tr,
>  {
>  	struct ftrace_event_call *call = &event_funcgraph_exit;
>  	struct ring_buffer_event *event;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ftrace_graph_ret_entry *entry;
>  
>  	if (unlikely(__this_cpu_read(ftrace_cpu_disabled)))
> @@ -350,7 +350,7 @@ void trace_graph_return(struct ftrace_graph_ret *trace)
>  
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	disabled = atomic_inc_return(&data->disabled);
>  	if (likely(disabled == 1)) {
>  		pc = preempt_count();
> @@ -560,9 +560,9 @@ get_return_for_leaf(struct trace_iterator *iter,
>  			 * We need to consume the current entry to see
>  			 * the next one.
>  			 */
> -			ring_buffer_consume(iter->tr->buffer, iter->cpu,
> +			ring_buffer_consume(iter->trace_buffer->buffer, iter->cpu,
>  					    NULL, NULL);
> -			event = ring_buffer_peek(iter->tr->buffer, iter->cpu,
> +			event = ring_buffer_peek(iter->trace_buffer->buffer, iter->cpu,
>  						 NULL, NULL);
>  		}
>  
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index 443b25b..b19d065 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -33,6 +33,7 @@ enum {
>  static int trace_type __read_mostly;
>  
>  static int save_flags;
> +static bool function_enabled;
>  
>  static void stop_irqsoff_tracer(struct trace_array *tr, int graph);
>  static int start_irqsoff_tracer(struct trace_array *tr, int graph);
> @@ -121,7 +122,7 @@ static int func_prolog_dec(struct trace_array *tr,
>  	if (!irqs_disabled_flags(*flags))
>  		return 0;
>  
> -	*data = tr->data[cpu];
> +	*data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	disabled = atomic_inc_return(&(*data)->disabled);
>  
>  	if (likely(disabled == 1))
> @@ -175,7 +176,7 @@ static int irqsoff_set_flag(u32 old_flags, u32 bit, int set)
>  		per_cpu(tracing_cpu, cpu) = 0;
>  
>  	tracing_max_latency = 0;
> -	tracing_reset_online_cpus(irqsoff_trace);
> +	tracing_reset_online_cpus(&irqsoff_trace->trace_buffer);
>  
>  	return start_irqsoff_tracer(irqsoff_trace, set);
>  }
> @@ -380,7 +381,7 @@ start_critical_timing(unsigned long ip, unsigned long parent_ip)
>  	if (per_cpu(tracing_cpu, cpu))
>  		return;
>  
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  
>  	if (unlikely(!data) || atomic_read(&data->disabled))
>  		return;
> @@ -418,7 +419,7 @@ stop_critical_timing(unsigned long ip, unsigned long parent_ip)
>  	if (!tracer_enabled)
>  		return;
>  
> -	data = tr->data[cpu];
> +	data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  
>  	if (unlikely(!data) ||
>  	    !data->critical_start || atomic_read(&data->disabled))
> @@ -528,15 +529,60 @@ void trace_preempt_off(unsigned long a0, unsigned long a1)
>  }
>  #endif /* CONFIG_PREEMPT_TRACER */
>  
> -static int start_irqsoff_tracer(struct trace_array *tr, int graph)
> +static int register_irqsoff_function(int graph, int set)
>  {
> -	int ret = 0;
> +	int ret;
>  
> -	if (!graph)
> -		ret = register_ftrace_function(&trace_ops);
> -	else
> +	/* 'set' is set if TRACE_ITER_FUNCTION is about to be set */
> +	if (function_enabled || (!set && !(trace_flags & TRACE_ITER_FUNCTION)))
> +		return 0;
> +
> +	if (graph)
>  		ret = register_ftrace_graph(&irqsoff_graph_return,
>  					    &irqsoff_graph_entry);
> +	else
> +		ret = register_ftrace_function(&trace_ops);
> +
> +	if (!ret)
> +		function_enabled = true;
> +
> +	return ret;
> +}
> +
> +static void unregister_irqsoff_function(int graph)
> +{
> +	if (!function_enabled)
> +		return;
> +
> +	if (graph)
> +		unregister_ftrace_graph();
> +	else
> +		unregister_ftrace_function(&trace_ops);
> +
> +	function_enabled = false;
> +}
> +
> +static void irqsoff_function_set(int set)
> +{
> +	if (set)
> +		register_irqsoff_function(is_graph(), 1);
> +	else
> +		unregister_irqsoff_function(is_graph());
> +}
> +
> +static int irqsoff_flag_changed(struct tracer *tracer, u32 mask, int set)
> +{
> +	if (mask & TRACE_ITER_FUNCTION)
> +		irqsoff_function_set(set);
> +
> +	return trace_keep_overwrite(tracer, mask, set);
> +}
> +
> +static int start_irqsoff_tracer(struct trace_array *tr, int graph)
> +{
> +	int ret;
> +
> +	ret = register_irqsoff_function(graph, 0);
>  
>  	if (!ret && tracing_is_enabled())
>  		tracer_enabled = 1;
> @@ -550,10 +596,7 @@ static void stop_irqsoff_tracer(struct trace_array *tr, int graph)
>  {
>  	tracer_enabled = 0;
>  
> -	if (!graph)
> -		unregister_ftrace_function(&trace_ops);
> -	else
> -		unregister_ftrace_graph();
> +	unregister_irqsoff_function(graph);
>  }
>  
>  static void __irqsoff_tracer_init(struct trace_array *tr)
> @@ -561,14 +604,14 @@ static void __irqsoff_tracer_init(struct trace_array *tr)
>  	save_flags = trace_flags;
>  
>  	/* non overwrite screws up the latency tracers */
> -	set_tracer_flag(TRACE_ITER_OVERWRITE, 1);
> -	set_tracer_flag(TRACE_ITER_LATENCY_FMT, 1);
> +	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, 1);
> +	set_tracer_flag(tr, TRACE_ITER_LATENCY_FMT, 1);
>  
>  	tracing_max_latency = 0;
>  	irqsoff_trace = tr;
>  	/* make sure that the tracer is visible */
>  	smp_wmb();
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  
>  	if (start_irqsoff_tracer(tr, is_graph()))
>  		printk(KERN_ERR "failed to start irqsoff tracer\n");
> @@ -581,8 +624,8 @@ static void irqsoff_tracer_reset(struct trace_array *tr)
>  
>  	stop_irqsoff_tracer(tr, is_graph());
>  
> -	set_tracer_flag(TRACE_ITER_LATENCY_FMT, lat_flag);
> -	set_tracer_flag(TRACE_ITER_OVERWRITE, overwrite_flag);
> +	set_tracer_flag(tr, TRACE_ITER_LATENCY_FMT, lat_flag);
> +	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, overwrite_flag);
>  }
>  
>  static void irqsoff_tracer_start(struct trace_array *tr)
> @@ -615,7 +658,7 @@ static struct tracer irqsoff_tracer __read_mostly =
>  	.print_line     = irqsoff_print_line,
>  	.flags		= &tracer_flags,
>  	.set_flag	= irqsoff_set_flag,
> -	.flag_changed	= trace_keep_overwrite,
> +	.flag_changed	= irqsoff_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_irqsoff,
>  #endif
> @@ -649,7 +692,7 @@ static struct tracer preemptoff_tracer __read_mostly =
>  	.print_line     = irqsoff_print_line,
>  	.flags		= &tracer_flags,
>  	.set_flag	= irqsoff_set_flag,
> -	.flag_changed	= trace_keep_overwrite,
> +	.flag_changed	= irqsoff_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_preemptoff,
>  #endif
> @@ -685,7 +728,7 @@ static struct tracer preemptirqsoff_tracer __read_mostly =
>  	.print_line     = irqsoff_print_line,
>  	.flags		= &tracer_flags,
>  	.set_flag	= irqsoff_set_flag,
> -	.flag_changed	= trace_keep_overwrite,
> +	.flag_changed	= irqsoff_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_preemptirqsoff,
>  #endif
> diff --git a/kernel/trace/trace_kdb.c b/kernel/trace/trace_kdb.c
> index 3c5c5df..bd90e1b 100644
> --- a/kernel/trace/trace_kdb.c
> +++ b/kernel/trace/trace_kdb.c
> @@ -26,7 +26,7 @@ static void ftrace_dump_buf(int skip_lines, long cpu_file)
>  	trace_init_global_iter(&iter);
>  
>  	for_each_tracing_cpu(cpu) {
> -		atomic_inc(&iter.tr->data[cpu]->disabled);
> +		atomic_inc(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled);
>  	}
>  
>  	old_userobj = trace_flags;
> @@ -43,17 +43,17 @@ static void ftrace_dump_buf(int skip_lines, long cpu_file)
>  	iter.iter_flags |= TRACE_FILE_LAT_FMT;
>  	iter.pos = -1;
>  
> -	if (cpu_file == TRACE_PIPE_ALL_CPU) {
> +	if (cpu_file == RING_BUFFER_ALL_CPUS) {
>  		for_each_tracing_cpu(cpu) {
>  			iter.buffer_iter[cpu] =
> -			ring_buffer_read_prepare(iter.tr->buffer, cpu);
> +			ring_buffer_read_prepare(iter.trace_buffer->buffer, cpu);
>  			ring_buffer_read_start(iter.buffer_iter[cpu]);
>  			tracing_iter_reset(&iter, cpu);
>  		}
>  	} else {
>  		iter.cpu_file = cpu_file;
>  		iter.buffer_iter[cpu_file] =
> -			ring_buffer_read_prepare(iter.tr->buffer, cpu_file);
> +			ring_buffer_read_prepare(iter.trace_buffer->buffer, cpu_file);
>  		ring_buffer_read_start(iter.buffer_iter[cpu_file]);
>  		tracing_iter_reset(&iter, cpu_file);
>  	}
> @@ -83,7 +83,7 @@ out:
>  	trace_flags = old_userobj;
>  
>  	for_each_tracing_cpu(cpu) {
> -		atomic_dec(&iter.tr->data[cpu]->disabled);
> +		atomic_dec(&per_cpu_ptr(iter.trace_buffer->data, cpu)->disabled);
>  	}
>  
>  	for_each_tracing_cpu(cpu)
> @@ -115,7 +115,7 @@ static int kdb_ftdump(int argc, const char **argv)
>  		    !cpu_online(cpu_file))
>  			return KDB_BADINT;
>  	} else {
> -		cpu_file = TRACE_PIPE_ALL_CPU;
> +		cpu_file = RING_BUFFER_ALL_CPUS;
>  	}
>  
>  	kdb_trap_printk++;
> diff --git a/kernel/trace/trace_mmiotrace.c b/kernel/trace/trace_mmiotrace.c
> index fd3c8aa..a5e8f48 100644
> --- a/kernel/trace/trace_mmiotrace.c
> +++ b/kernel/trace/trace_mmiotrace.c
> @@ -31,7 +31,7 @@ static void mmio_reset_data(struct trace_array *tr)
>  	overrun_detected = false;
>  	prev_overruns = 0;
>  
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  }
>  
>  static int mmio_trace_init(struct trace_array *tr)
> @@ -128,7 +128,7 @@ static void mmio_close(struct trace_iterator *iter)
>  static unsigned long count_overruns(struct trace_iterator *iter)
>  {
>  	unsigned long cnt = atomic_xchg(&dropped_count, 0);
> -	unsigned long over = ring_buffer_overruns(iter->tr->buffer);
> +	unsigned long over = ring_buffer_overruns(iter->trace_buffer->buffer);
>  
>  	if (over > prev_overruns)
>  		cnt += over - prev_overruns;
> @@ -309,7 +309,7 @@ static void __trace_mmiotrace_rw(struct trace_array *tr,
>  				struct mmiotrace_rw *rw)
>  {
>  	struct ftrace_event_call *call = &event_mmiotrace_rw;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ring_buffer_event *event;
>  	struct trace_mmiotrace_rw *entry;
>  	int pc = preempt_count();
> @@ -330,7 +330,7 @@ static void __trace_mmiotrace_rw(struct trace_array *tr,
>  void mmio_trace_rw(struct mmiotrace_rw *rw)
>  {
>  	struct trace_array *tr = mmio_trace_array;
> -	struct trace_array_cpu *data = tr->data[smp_processor_id()];
> +	struct trace_array_cpu *data = per_cpu_ptr(tr->trace_buffer.data, smp_processor_id());
>  	__trace_mmiotrace_rw(tr, data, rw);
>  }
>  
> @@ -339,7 +339,7 @@ static void __trace_mmiotrace_map(struct trace_array *tr,
>  				struct mmiotrace_map *map)
>  {
>  	struct ftrace_event_call *call = &event_mmiotrace_map;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ring_buffer_event *event;
>  	struct trace_mmiotrace_map *entry;
>  	int pc = preempt_count();
> @@ -363,7 +363,7 @@ void mmio_trace_mapping(struct mmiotrace_map *map)
>  	struct trace_array_cpu *data;
>  
>  	preempt_disable();
> -	data = tr->data[smp_processor_id()];
> +	data = per_cpu_ptr(tr->trace_buffer.data, smp_processor_id());
>  	__trace_mmiotrace_map(tr, data, map);
>  	preempt_enable();
>  }
> diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
> index 194d796..f475b2a 100644
> --- a/kernel/trace/trace_output.c
> +++ b/kernel/trace/trace_output.c
> @@ -14,7 +14,7 @@
>  /* must be a power of 2 */
>  #define EVENT_HASHSIZE	128
>  
> -DECLARE_RWSEM(trace_event_mutex);
> +DECLARE_RWSEM(trace_event_sem);
>  
>  static struct hlist_head event_hash[EVENT_HASHSIZE] __read_mostly;
>  
> @@ -37,6 +37,22 @@ int trace_print_seq(struct seq_file *m, struct trace_seq *s)
>  	return ret;
>  }
>  
> +enum print_line_t trace_print_bputs_msg_only(struct trace_iterator *iter)
> +{
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_entry *entry = iter->ent;
> +	struct bputs_entry *field;
> +	int ret;
> +
> +	trace_assign_type(field, entry);
> +
> +	ret = trace_seq_puts(s, field->str);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return TRACE_TYPE_HANDLED;
> +}
> +
>  enum print_line_t trace_print_bprintk_msg_only(struct trace_iterator *iter)
>  {
>  	struct trace_seq *s = &iter->seq;
> @@ -397,6 +413,32 @@ ftrace_print_hex_seq(struct trace_seq *p, const unsigned char *buf, int buf_len)
>  }
>  EXPORT_SYMBOL(ftrace_print_hex_seq);
>  
> +int ftrace_raw_output_prep(struct trace_iterator *iter,
> +			   struct trace_event *trace_event)
> +{
> +	struct ftrace_event_call *event;
> +	struct trace_seq *s = &iter->seq;
> +	struct trace_seq *p = &iter->tmp_seq;
> +	struct trace_entry *entry;
> +	int ret;
> +
> +	event = container_of(trace_event, struct ftrace_event_call, event);
> +	entry = iter->ent;
> +
> +	if (entry->type != event->event.type) {
> +		WARN_ON_ONCE(1);
> +		return TRACE_TYPE_UNHANDLED;
> +	}
> +
> +	trace_seq_init(p);
> +	ret = trace_seq_printf(s, "%s: ", event->name);
> +	if (!ret)
> +		return TRACE_TYPE_PARTIAL_LINE;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ftrace_raw_output_prep);
> +
>  #ifdef CONFIG_KRETPROBES
>  static inline const char *kretprobed(const char *name)
>  {
> @@ -617,7 +659,7 @@ lat_print_timestamp(struct trace_iterator *iter, u64 next_ts)
>  {
>  	unsigned long verbose = trace_flags & TRACE_ITER_VERBOSE;
>  	unsigned long in_ns = iter->iter_flags & TRACE_FILE_TIME_IN_NS;
> -	unsigned long long abs_ts = iter->ts - iter->tr->time_start;
> +	unsigned long long abs_ts = iter->ts - iter->trace_buffer->time_start;
>  	unsigned long long rel_ts = next_ts - iter->ts;
>  	struct trace_seq *s = &iter->seq;
>  
> @@ -784,12 +826,12 @@ static int trace_search_list(struct list_head **list)
>  
>  void trace_event_read_lock(void)
>  {
> -	down_read(&trace_event_mutex);
> +	down_read(&trace_event_sem);
>  }
>  
>  void trace_event_read_unlock(void)
>  {
> -	up_read(&trace_event_mutex);
> +	up_read(&trace_event_sem);
>  }
>  
>  /**
> @@ -812,7 +854,7 @@ int register_ftrace_event(struct trace_event *event)
>  	unsigned key;
>  	int ret = 0;
>  
> -	down_write(&trace_event_mutex);
> +	down_write(&trace_event_sem);
>  
>  	if (WARN_ON(!event))
>  		goto out;
> @@ -867,14 +909,14 @@ int register_ftrace_event(struct trace_event *event)
>  
>  	ret = event->type;
>   out:
> -	up_write(&trace_event_mutex);
> +	up_write(&trace_event_sem);
>  
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(register_ftrace_event);
>  
>  /*
> - * Used by module code with the trace_event_mutex held for write.
> + * Used by module code with the trace_event_sem held for write.
>   */
>  int __unregister_ftrace_event(struct trace_event *event)
>  {
> @@ -889,9 +931,9 @@ int __unregister_ftrace_event(struct trace_event *event)
>   */
>  int unregister_ftrace_event(struct trace_event *event)
>  {
> -	down_write(&trace_event_mutex);
> +	down_write(&trace_event_sem);
>  	__unregister_ftrace_event(event);
> -	up_write(&trace_event_mutex);
> +	up_write(&trace_event_sem);
>  
>  	return 0;
>  }
> @@ -1218,6 +1260,64 @@ static struct trace_event trace_user_stack_event = {
>  	.funcs		= &trace_user_stack_funcs,
>  };
>  
> +/* TRACE_BPUTS */
> +static enum print_line_t
> +trace_bputs_print(struct trace_iterator *iter, int flags,
> +		   struct trace_event *event)
> +{
> +	struct trace_entry *entry = iter->ent;
> +	struct trace_seq *s = &iter->seq;
> +	struct bputs_entry *field;
> +
> +	trace_assign_type(field, entry);
> +
> +	if (!seq_print_ip_sym(s, field->ip, flags))
> +		goto partial;
> +
> +	if (!trace_seq_puts(s, ": "))
> +		goto partial;
> +
> +	if (!trace_seq_puts(s, field->str))
> +		goto partial;
> +
> +	return TRACE_TYPE_HANDLED;
> +
> + partial:
> +	return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +
> +static enum print_line_t
> +trace_bputs_raw(struct trace_iterator *iter, int flags,
> +		struct trace_event *event)
> +{
> +	struct bputs_entry *field;
> +	struct trace_seq *s = &iter->seq;
> +
> +	trace_assign_type(field, iter->ent);
> +
> +	if (!trace_seq_printf(s, ": %lx : ", field->ip))
> +		goto partial;
> +
> +	if (!trace_seq_puts(s, field->str))
> +		goto partial;
> +
> +	return TRACE_TYPE_HANDLED;
> +
> + partial:
> +	return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +static struct trace_event_functions trace_bputs_funcs = {
> +	.trace		= trace_bputs_print,
> +	.raw		= trace_bputs_raw,
> +};
> +
> +static struct trace_event trace_bputs_event = {
> +	.type		= TRACE_BPUTS,
> +	.funcs		= &trace_bputs_funcs,
> +};
> +
>  /* TRACE_BPRINT */
>  static enum print_line_t
>  trace_bprint_print(struct trace_iterator *iter, int flags,
> @@ -1330,6 +1430,7 @@ static struct trace_event *events[] __initdata = {
>  	&trace_wake_event,
>  	&trace_stack_event,
>  	&trace_user_stack_event,
> +	&trace_bputs_event,
>  	&trace_bprint_event,
>  	&trace_print_event,
>  	NULL
> diff --git a/kernel/trace/trace_output.h b/kernel/trace/trace_output.h
> index c038eba..127a9d8 100644
> --- a/kernel/trace/trace_output.h
> +++ b/kernel/trace/trace_output.h
> @@ -5,6 +5,8 @@
>  #include "trace.h"
>  
>  extern enum print_line_t
> +trace_print_bputs_msg_only(struct trace_iterator *iter);
> +extern enum print_line_t
>  trace_print_bprintk_msg_only(struct trace_iterator *iter);
>  extern enum print_line_t
>  trace_print_printk_msg_only(struct trace_iterator *iter);
> @@ -31,7 +33,7 @@ trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry);
>  
>  /* used by module unregistering */
>  extern int __unregister_ftrace_event(struct trace_event *event);
> -extern struct rw_semaphore trace_event_mutex;
> +extern struct rw_semaphore trace_event_sem;
>  
>  #define MAX_MEMHEX_BYTES	8
>  #define HEX_CHARS		(MAX_MEMHEX_BYTES*2 + 1)
> diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> index 3374c79..4e98e3b 100644
> --- a/kernel/trace/trace_sched_switch.c
> +++ b/kernel/trace/trace_sched_switch.c
> @@ -28,7 +28,7 @@ tracing_sched_switch_trace(struct trace_array *tr,
>  			   unsigned long flags, int pc)
>  {
>  	struct ftrace_event_call *call = &event_context_switch;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  	struct ring_buffer_event *event;
>  	struct ctx_switch_entry *entry;
>  
> @@ -69,7 +69,7 @@ probe_sched_switch(void *ignore, struct task_struct *prev, struct task_struct *n
>  	pc = preempt_count();
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	data = ctx_trace->data[cpu];
> +	data = per_cpu_ptr(ctx_trace->trace_buffer.data, cpu);
>  
>  	if (likely(!atomic_read(&data->disabled)))
>  		tracing_sched_switch_trace(ctx_trace, prev, next, flags, pc);
> @@ -86,7 +86,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
>  	struct ftrace_event_call *call = &event_wakeup;
>  	struct ring_buffer_event *event;
>  	struct ctx_switch_entry *entry;
> -	struct ring_buffer *buffer = tr->buffer;
> +	struct ring_buffer *buffer = tr->trace_buffer.buffer;
>  
>  	event = trace_buffer_lock_reserve(buffer, TRACE_WAKE,
>  					  sizeof(*entry), flags, pc);
> @@ -123,7 +123,7 @@ probe_sched_wakeup(void *ignore, struct task_struct *wakee, int success)
>  	pc = preempt_count();
>  	local_irq_save(flags);
>  	cpu = raw_smp_processor_id();
> -	data = ctx_trace->data[cpu];
> +	data = per_cpu_ptr(ctx_trace->trace_buffer.data, cpu);
>  
>  	if (likely(!atomic_read(&data->disabled)))
>  		tracing_sched_wakeup_trace(ctx_trace, wakee, current,
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index fde652c..fee77e1 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -37,6 +37,7 @@ static int wakeup_graph_entry(struct ftrace_graph_ent *trace);
>  static void wakeup_graph_return(struct ftrace_graph_ret *trace);
>  
>  static int save_flags;
> +static bool function_enabled;
>  
>  #define TRACE_DISPLAY_GRAPH     1
>  
> @@ -89,7 +90,7 @@ func_prolog_preempt_disable(struct trace_array *tr,
>  	if (cpu != wakeup_current_cpu)
>  		goto out_enable;
>  
> -	*data = tr->data[cpu];
> +	*data = per_cpu_ptr(tr->trace_buffer.data, cpu);
>  	disabled = atomic_inc_return(&(*data)->disabled);
>  	if (unlikely(disabled != 1))
>  		goto out;
> @@ -134,15 +135,60 @@ static struct ftrace_ops trace_ops __read_mostly =
>  };
>  #endif /* CONFIG_FUNCTION_TRACER */
>  
> -static int start_func_tracer(int graph)
> +static int register_wakeup_function(int graph, int set)
>  {
>  	int ret;
>  
> -	if (!graph)
> -		ret = register_ftrace_function(&trace_ops);
> -	else
> +	/* 'set' is set if TRACE_ITER_FUNCTION is about to be set */
> +	if (function_enabled || (!set && !(trace_flags & TRACE_ITER_FUNCTION)))
> +		return 0;
> +
> +	if (graph)
>  		ret = register_ftrace_graph(&wakeup_graph_return,
>  					    &wakeup_graph_entry);
> +	else
> +		ret = register_ftrace_function(&trace_ops);
> +
> +	if (!ret)
> +		function_enabled = true;
> +
> +	return ret;
> +}
> +
> +static void unregister_wakeup_function(int graph)
> +{
> +	if (!function_enabled)
> +		return;
> +
> +	if (graph)
> +		unregister_ftrace_graph();
> +	else
> +		unregister_ftrace_function(&trace_ops);
> +
> +	function_enabled = false;
> +}
> +
> +static void wakeup_function_set(int set)
> +{
> +	if (set)
> +		register_wakeup_function(is_graph(), 1);
> +	else
> +		unregister_wakeup_function(is_graph());
> +}
> +
> +static int wakeup_flag_changed(struct tracer *tracer, u32 mask, int set)
> +{
> +	if (mask & TRACE_ITER_FUNCTION)
> +		wakeup_function_set(set);
> +
> +	return trace_keep_overwrite(tracer, mask, set);
> +}
> +
> +static int start_func_tracer(int graph)
> +{
> +	int ret;
> +
> +	ret = register_wakeup_function(graph, 0);
>  
>  	if (!ret && tracing_is_enabled())
>  		tracer_enabled = 1;
> @@ -156,10 +202,7 @@ static void stop_func_tracer(int graph)
>  {
>  	tracer_enabled = 0;
>  
> -	if (!graph)
> -		unregister_ftrace_function(&trace_ops);
> -	else
> -		unregister_ftrace_graph();
> +	unregister_wakeup_function(graph);
>  }
>  
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> @@ -353,7 +396,7 @@ probe_wakeup_sched_switch(void *ignore,
>  
>  	/* disable local data, not wakeup_cpu data */
>  	cpu = raw_smp_processor_id();
> -	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
> +	disabled = atomic_inc_return(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
>  	if (likely(disabled != 1))
>  		goto out;
>  
> @@ -365,7 +408,7 @@ probe_wakeup_sched_switch(void *ignore,
>  		goto out_unlock;
>  
>  	/* The task we are waiting for is waking up */
> -	data = wakeup_trace->data[wakeup_cpu];
> +	data = per_cpu_ptr(wakeup_trace->trace_buffer.data, wakeup_cpu);
>  
>  	__trace_function(wakeup_trace, CALLER_ADDR0, CALLER_ADDR1, flags, pc);
>  	tracing_sched_switch_trace(wakeup_trace, prev, next, flags, pc);
> @@ -387,7 +430,7 @@ out_unlock:
>  	arch_spin_unlock(&wakeup_lock);
>  	local_irq_restore(flags);
>  out:
> -	atomic_dec(&wakeup_trace->data[cpu]->disabled);
> +	atomic_dec(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
>  }
>  
>  static void __wakeup_reset(struct trace_array *tr)
> @@ -405,7 +448,7 @@ static void wakeup_reset(struct trace_array *tr)
>  {
>  	unsigned long flags;
>  
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  
>  	local_irq_save(flags);
>  	arch_spin_lock(&wakeup_lock);
> @@ -435,7 +478,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  		return;
>  
>  	pc = preempt_count();
> -	disabled = atomic_inc_return(&wakeup_trace->data[cpu]->disabled);
> +	disabled = atomic_inc_return(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
>  	if (unlikely(disabled != 1))
>  		goto out;
>  
> @@ -458,7 +501,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  
>  	local_save_flags(flags);
>  
> -	data = wakeup_trace->data[wakeup_cpu];
> +	data = per_cpu_ptr(wakeup_trace->trace_buffer.data, wakeup_cpu);
>  	data->preempt_timestamp = ftrace_now(cpu);
>  	tracing_sched_wakeup_trace(wakeup_trace, p, current, flags, pc);
>  
> @@ -472,7 +515,7 @@ probe_wakeup(void *ignore, struct task_struct *p, int success)
>  out_locked:
>  	arch_spin_unlock(&wakeup_lock);
>  out:
> -	atomic_dec(&wakeup_trace->data[cpu]->disabled);
> +	atomic_dec(&per_cpu_ptr(wakeup_trace->trace_buffer.data, cpu)->disabled);
>  }
>  
>  static void start_wakeup_tracer(struct trace_array *tr)
> @@ -543,8 +586,8 @@ static int __wakeup_tracer_init(struct trace_array *tr)
>  	save_flags = trace_flags;
>  
>  	/* non overwrite screws up the latency tracers */
> -	set_tracer_flag(TRACE_ITER_OVERWRITE, 1);
> -	set_tracer_flag(TRACE_ITER_LATENCY_FMT, 1);
> +	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, 1);
> +	set_tracer_flag(tr, TRACE_ITER_LATENCY_FMT, 1);
>  
>  	tracing_max_latency = 0;
>  	wakeup_trace = tr;
> @@ -573,8 +616,8 @@ static void wakeup_tracer_reset(struct trace_array *tr)
>  	/* make sure we put back any tasks we are tracing */
>  	wakeup_reset(tr);
>  
> -	set_tracer_flag(TRACE_ITER_LATENCY_FMT, lat_flag);
> -	set_tracer_flag(TRACE_ITER_OVERWRITE, overwrite_flag);
> +	set_tracer_flag(tr, TRACE_ITER_LATENCY_FMT, lat_flag);
> +	set_tracer_flag(tr, TRACE_ITER_OVERWRITE, overwrite_flag);
>  }
>  
>  static void wakeup_tracer_start(struct trace_array *tr)
> @@ -600,7 +643,7 @@ static struct tracer wakeup_tracer __read_mostly =
>  	.print_line	= wakeup_print_line,
>  	.flags		= &tracer_flags,
>  	.set_flag	= wakeup_set_flag,
> -	.flag_changed	= trace_keep_overwrite,
> +	.flag_changed	= wakeup_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_wakeup,
>  #endif
> @@ -622,7 +665,7 @@ static struct tracer wakeup_rt_tracer __read_mostly =
>  	.print_line	= wakeup_print_line,
>  	.flags		= &tracer_flags,
>  	.set_flag	= wakeup_set_flag,
> -	.flag_changed	= trace_keep_overwrite,
> +	.flag_changed	= wakeup_flag_changed,
>  #ifdef CONFIG_FTRACE_SELFTEST
>  	.selftest    = trace_selftest_startup_wakeup,
>  #endif
> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> index 51c819c..55e2cf6 100644
> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -21,13 +21,13 @@ static inline int trace_valid_entry(struct trace_entry *entry)
>  	return 0;
>  }
>  
> -static int trace_test_buffer_cpu(struct trace_array *tr, int cpu)
> +static int trace_test_buffer_cpu(struct trace_buffer *buf, int cpu)
>  {
>  	struct ring_buffer_event *event;
>  	struct trace_entry *entry;
>  	unsigned int loops = 0;
>  
> -	while ((event = ring_buffer_consume(tr->buffer, cpu, NULL, NULL))) {
> +	while ((event = ring_buffer_consume(buf->buffer, cpu, NULL, NULL))) {
>  		entry = ring_buffer_event_data(event);
>  
>  		/*
> @@ -58,7 +58,7 @@ static int trace_test_buffer_cpu(struct trace_array *tr, int cpu)
>   * Test the trace buffer to see if all the elements
>   * are still sane.
>   */
> -static int trace_test_buffer(struct trace_array *tr, unsigned long *count)
> +static int trace_test_buffer(struct trace_buffer *buf, unsigned long *count)
>  {
>  	unsigned long flags, cnt = 0;
>  	int cpu, ret = 0;
> @@ -67,7 +67,7 @@ static int trace_test_buffer(struct trace_array *tr, unsigned long *count)
>  	local_irq_save(flags);
>  	arch_spin_lock(&ftrace_max_lock);
>  
> -	cnt = ring_buffer_entries(tr->buffer);
> +	cnt = ring_buffer_entries(buf->buffer);
>  
>  	/*
>  	 * The trace_test_buffer_cpu runs a while loop to consume all data.
> @@ -78,7 +78,7 @@ static int trace_test_buffer(struct trace_array *tr, unsigned long *count)
>  	 */
>  	tracing_off();
>  	for_each_possible_cpu(cpu) {
> -		ret = trace_test_buffer_cpu(tr, cpu);
> +		ret = trace_test_buffer_cpu(buf, cpu);
>  		if (ret)
>  			break;
>  	}
> @@ -355,7 +355,7 @@ int trace_selftest_startup_dynamic_tracing(struct tracer *trace,
>  	msleep(100);
>  
>  	/* we should have nothing in the buffer */
> -	ret = trace_test_buffer(tr, &count);
> +	ret = trace_test_buffer(&tr->trace_buffer, &count);
>  	if (ret)
>  		goto out;
>  
> @@ -376,7 +376,7 @@ int trace_selftest_startup_dynamic_tracing(struct tracer *trace,
>  	ftrace_enabled = 0;
>  
>  	/* check the trace buffer */
> -	ret = trace_test_buffer(tr, &count);
> +	ret = trace_test_buffer(&tr->trace_buffer, &count);
>  	tracing_start();
>  
>  	/* we should only have one item */
> @@ -666,7 +666,7 @@ trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr)
>  	ftrace_enabled = 0;
>  
>  	/* check the trace buffer */
> -	ret = trace_test_buffer(tr, &count);
> +	ret = trace_test_buffer(&tr->trace_buffer, &count);
>  	trace->reset(tr);
>  	tracing_start();
>  
> @@ -703,8 +703,6 @@ trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr)
>  /* Maximum number of functions to trace before diagnosing a hang */
>  #define GRAPH_MAX_FUNC_TEST	100000000
>  
> -static void
> -__ftrace_dump(bool disable_tracing, enum ftrace_dump_mode oops_dump_mode);
>  static unsigned int graph_hang_thresh;
>  
>  /* Wrap the real function entry probe to avoid possible hanging */
> @@ -714,8 +712,11 @@ static int trace_graph_entry_watchdog(struct ftrace_graph_ent *trace)
>  	if (unlikely(++graph_hang_thresh > GRAPH_MAX_FUNC_TEST)) {
>  		ftrace_graph_stop();
>  		printk(KERN_WARNING "BUG: Function graph tracer hang!\n");
> -		if (ftrace_dump_on_oops)
> -			__ftrace_dump(false, DUMP_ALL);
> +		if (ftrace_dump_on_oops) {
> +			ftrace_dump(DUMP_ALL);
> +			/* ftrace_dump() disables tracing */
> +			tracing_on();
> +		}
>  		return 0;
>  	}
>  
> @@ -737,7 +738,7 @@ trace_selftest_startup_function_graph(struct tracer *trace,
>  	 * Simulate the init() callback but we attach a watchdog callback
>  	 * to detect and recover from possible hangs
>  	 */
> -	tracing_reset_online_cpus(tr);
> +	tracing_reset_online_cpus(&tr->trace_buffer);
>  	set_graph_array(tr);
>  	ret = register_ftrace_graph(&trace_graph_return,
>  				    &trace_graph_entry_watchdog);
> @@ -760,7 +761,7 @@ trace_selftest_startup_function_graph(struct tracer *trace,
>  	tracing_stop();
>  
>  	/* check the trace buffer */
> -	ret = trace_test_buffer(tr, &count);
> +	ret = trace_test_buffer(&tr->trace_buffer, &count);
>  
>  	trace->reset(tr);
>  	tracing_start();
> @@ -815,9 +816,9 @@ trace_selftest_startup_irqsoff(struct tracer *trace, struct trace_array *tr)
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check both trace buffers */
> -	ret = trace_test_buffer(tr, NULL);
> +	ret = trace_test_buffer(&tr->trace_buffer, NULL);
>  	if (!ret)
> -		ret = trace_test_buffer(&max_tr, &count);
> +		ret = trace_test_buffer(&tr->max_buffer, &count);
>  	trace->reset(tr);
>  	tracing_start();
>  
> @@ -877,9 +878,9 @@ trace_selftest_startup_preemptoff(struct tracer *trace, struct trace_array *tr)
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check both trace buffers */
> -	ret = trace_test_buffer(tr, NULL);
> +	ret = trace_test_buffer(&tr->trace_buffer, NULL);
>  	if (!ret)
> -		ret = trace_test_buffer(&max_tr, &count);
> +		ret = trace_test_buffer(&tr->max_buffer, &count);
>  	trace->reset(tr);
>  	tracing_start();
>  
> @@ -943,11 +944,11 @@ trace_selftest_startup_preemptirqsoff(struct tracer *trace, struct trace_array *
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check both trace buffers */
> -	ret = trace_test_buffer(tr, NULL);
> +	ret = trace_test_buffer(&tr->trace_buffer, NULL);
>  	if (ret)
>  		goto out;
>  
> -	ret = trace_test_buffer(&max_tr, &count);
> +	ret = trace_test_buffer(&tr->max_buffer, &count);
>  	if (ret)
>  		goto out;
>  
> @@ -973,11 +974,11 @@ trace_selftest_startup_preemptirqsoff(struct tracer *trace, struct trace_array *
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check both trace buffers */
> -	ret = trace_test_buffer(tr, NULL);
> +	ret = trace_test_buffer(&tr->trace_buffer, NULL);
>  	if (ret)
>  		goto out;
>  
> -	ret = trace_test_buffer(&max_tr, &count);
> +	ret = trace_test_buffer(&tr->max_buffer, &count);
>  
>  	if (!ret && !count) {
>  		printk(KERN_CONT ".. no entries found ..");
> @@ -1084,10 +1085,10 @@ trace_selftest_startup_wakeup(struct tracer *trace, struct trace_array *tr)
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check both trace buffers */
> -	ret = trace_test_buffer(tr, NULL);
> +	ret = trace_test_buffer(&tr->trace_buffer, NULL);
>  	printk("ret = %d\n", ret);
>  	if (!ret)
> -		ret = trace_test_buffer(&max_tr, &count);
> +		ret = trace_test_buffer(&tr->max_buffer, &count);
>  
> 
>  	trace->reset(tr);
> @@ -1126,7 +1127,7 @@ trace_selftest_startup_sched_switch(struct tracer *trace, struct trace_array *tr
>  	/* stop the tracing. */
>  	tracing_stop();
>  	/* check the trace buffer */
> -	ret = trace_test_buffer(tr, &count);
> +	ret = trace_test_buffer(&tr->trace_buffer, &count);
>  	trace->reset(tr);
>  	tracing_start();
>  
> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> index 42ca822..aab277b 100644
> --- a/kernel/trace/trace_stack.c
> +++ b/kernel/trace/trace_stack.c
> @@ -20,13 +20,24 @@
>  
>  #define STACK_TRACE_ENTRIES 500
>  
> +#ifdef CC_USING_FENTRY
> +# define fentry		1
> +#else
> +# define fentry		0
> +#endif
> +
>  static unsigned long stack_dump_trace[STACK_TRACE_ENTRIES+1] =
>  	 { [0 ... (STACK_TRACE_ENTRIES)] = ULONG_MAX };
>  static unsigned stack_dump_index[STACK_TRACE_ENTRIES];
>  
> +/*
> + * Reserve one entry for the passed in ip. This will allow
> + * us to remove most or all of the stack size overhead
> + * added by the stack tracer itself.
> + */
>  static struct stack_trace max_stack_trace = {
> -	.max_entries		= STACK_TRACE_ENTRIES,
> -	.entries		= stack_dump_trace,
> +	.max_entries		= STACK_TRACE_ENTRIES - 1,
> +	.entries		= &stack_dump_trace[1],
>  };
>  
>  static unsigned long max_stack_size;
> @@ -39,25 +50,34 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
>  int stack_tracer_enabled;
>  static int last_stack_tracer_enabled;
>  
> -static inline void check_stack(void)
> +static inline void
> +check_stack(unsigned long ip, unsigned long *stack)
>  {
>  	unsigned long this_size, flags;
>  	unsigned long *p, *top, *start;
> +	static int tracer_frame;
> +	int frame_size = ACCESS_ONCE(tracer_frame);
>  	int i;
>  
> -	this_size = ((unsigned long)&this_size) & (THREAD_SIZE-1);
> +	this_size = ((unsigned long)stack) & (THREAD_SIZE-1);
>  	this_size = THREAD_SIZE - this_size;
> +	/* Remove the frame of the tracer */
> +	this_size -= frame_size;
>  
>  	if (this_size <= max_stack_size)
>  		return;
>  
>  	/* we do not handle interrupt stacks yet */
> -	if (!object_is_on_stack(&this_size))
> +	if (!object_is_on_stack(stack))
>  		return;
>  
>  	local_irq_save(flags);
>  	arch_spin_lock(&max_stack_lock);
>  
> +	/* In case another CPU set the tracer_frame on us */
> +	if (unlikely(!frame_size))
> +		this_size -= tracer_frame;
> +
>  	/* a race could have already updated it */
>  	if (this_size <= max_stack_size)
>  		goto out;
> @@ -70,10 +90,18 @@ static inline void check_stack(void)
>  	save_stack_trace(&max_stack_trace);
>  
>  	/*
> +	 * Add the passed in ip from the function tracer.
> +	 * Searching for this on the stack will skip over
> +	 * most of the overhead from the stack tracer itself.
> +	 */
> +	stack_dump_trace[0] = ip;
> +	max_stack_trace.nr_entries++;
> +
> +	/*
>  	 * Now find where in the stack these are.
>  	 */
>  	i = 0;
> -	start = &this_size;
> +	start = stack;
>  	top = (unsigned long *)
>  		(((unsigned long)start & ~(THREAD_SIZE-1)) + THREAD_SIZE);
>  
> @@ -97,6 +125,18 @@ static inline void check_stack(void)
>  				found = 1;
>  				/* Start the search from here */
>  				start = p + 1;
> +				/*
> +				 * We do not want to show the overhead
> +				 * of the stack tracer stack in the
> +				 * max stack. If we haven't figured
> +				 * out what that is, then figure it out
> +				 * now.
> +				 */
> +				if (unlikely(!tracer_frame) && i == 1) {
> +					tracer_frame = (p - stack) *
> +						sizeof(unsigned long);
> +					max_stack_size -= tracer_frame;
> +				}
>  			}
>  		}
>  
> @@ -113,6 +153,7 @@ static void
>  stack_trace_call(unsigned long ip, unsigned long parent_ip,
>  		 struct ftrace_ops *op, struct pt_regs *pt_regs)
>  {
> +	unsigned long stack;
>  	int cpu;
>  
>  	preempt_disable_notrace();
> @@ -122,7 +163,26 @@ stack_trace_call(unsigned long ip, unsigned long parent_ip,
>  	if (per_cpu(trace_active, cpu)++ != 0)
>  		goto out;
>  
> -	check_stack();
> +	/*
> +	 * When fentry is used, the traced function does not get
> +	 * its stack frame set up, and we lose the parent.
> +	 * The ip is pretty useless because the function tracer
> +	 * was called before that function set up its stack frame.
> +	 * In this case, we use the parent ip.
> +	 *
> +	 * By adding the return address of either the parent ip
> +	 * or the current ip we can disregard most of the stack usage
> +	 * caused by the stack tracer itself.
> +	 *
> +	 * The function tracer always reports the address of where the
> +	 * mcount call was, but the stack will hold the return address.
> +	 */
> +	if (fentry)
> +		ip = parent_ip;
> +	else
> +		ip += MCOUNT_INSN_SIZE;
> +
> +	check_stack(ip, &stack);
>  
>   out:
>  	per_cpu(trace_active, cpu)--;
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 7a809e3..8f2ac73 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -12,10 +12,6 @@
>  #include "trace.h"
>  
>  static DEFINE_MUTEX(syscall_trace_lock);
> -static int sys_refcount_enter;
> -static int sys_refcount_exit;
> -static DECLARE_BITMAP(enabled_enter_syscalls, NR_syscalls);
> -static DECLARE_BITMAP(enabled_exit_syscalls, NR_syscalls);
>  
>  static int syscall_enter_register(struct ftrace_event_call *event,
>  				 enum trace_reg type, void *data);
> @@ -41,7 +37,7 @@ static inline bool arch_syscall_match_sym_name(const char *sym, const char *name
>  	/*
>  	 * Only compare after the "sys" prefix. Archs that use
>  	 * syscall wrappers may have syscalls symbols aliases prefixed
> -	 * with "SyS" instead of "sys", leading to an unwanted
> +	 * with ".SyS" or ".sys" instead of "sys", leading to an unwanted
>  	 * mismatch.
>  	 */
>  	return !strcmp(sym + 3, name + 3);
> @@ -265,7 +261,7 @@ static void free_syscall_print_fmt(struct ftrace_event_call *call)
>  		kfree(call->print_fmt);
>  }
>  
> -static int syscall_enter_define_fields(struct ftrace_event_call *call)
> +static int __init syscall_enter_define_fields(struct ftrace_event_call *call)
>  {
>  	struct syscall_trace_enter trace;
>  	struct syscall_metadata *meta = call->data;
> @@ -288,7 +284,7 @@ static int syscall_enter_define_fields(struct ftrace_event_call *call)
>  	return ret;
>  }
>  
> -static int syscall_exit_define_fields(struct ftrace_event_call *call)
> +static int __init syscall_exit_define_fields(struct ftrace_event_call *call)
>  {
>  	struct syscall_trace_exit trace;
>  	int ret;
> @@ -303,8 +299,9 @@ static int syscall_exit_define_fields(struct ftrace_event_call *call)
>  	return ret;
>  }
>  
> -static void ftrace_syscall_enter(void *ignore, struct pt_regs *regs, long id)
> +static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
>  {
> +	struct trace_array *tr = data;
>  	struct syscall_trace_enter *entry;
>  	struct syscall_metadata *sys_data;
>  	struct ring_buffer_event *event;
> @@ -315,7 +312,7 @@ static void ftrace_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  	syscall_nr = trace_get_syscall_nr(current, regs);
>  	if (syscall_nr < 0)
>  		return;
> -	if (!test_bit(syscall_nr, enabled_enter_syscalls))
> +	if (!test_bit(syscall_nr, tr->enabled_enter_syscalls))
>  		return;
>  
>  	sys_data = syscall_nr_to_meta(syscall_nr);
> @@ -324,7 +321,8 @@ static void ftrace_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  
>  	size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>  
> -	event = trace_current_buffer_lock_reserve(&buffer,
> +	buffer = tr->trace_buffer.buffer;
> +	event = trace_buffer_lock_reserve(buffer,
>  			sys_data->enter_event->event.type, size, 0, 0);
>  	if (!event)
>  		return;
> @@ -338,8 +336,9 @@ static void ftrace_syscall_enter(void *ignore, struct pt_regs *regs, long id)
>  		trace_current_buffer_unlock_commit(buffer, event, 0, 0);
>  }
>  
> -static void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
> +static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
>  {
> +	struct trace_array *tr = data;
>  	struct syscall_trace_exit *entry;
>  	struct syscall_metadata *sys_data;
>  	struct ring_buffer_event *event;
> @@ -349,14 +348,15 @@ static void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>  	syscall_nr = trace_get_syscall_nr(current, regs);
>  	if (syscall_nr < 0)
>  		return;
> -	if (!test_bit(syscall_nr, enabled_exit_syscalls))
> +	if (!test_bit(syscall_nr, tr->enabled_exit_syscalls))
>  		return;
>  
>  	sys_data = syscall_nr_to_meta(syscall_nr);
>  	if (!sys_data)
>  		return;
>  
> -	event = trace_current_buffer_lock_reserve(&buffer,
> +	buffer = tr->trace_buffer.buffer;
> +	event = trace_buffer_lock_reserve(buffer,
>  			sys_data->exit_event->event.type, sizeof(*entry), 0, 0);
>  	if (!event)
>  		return;
> @@ -370,8 +370,10 @@ static void ftrace_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
>  		trace_current_buffer_unlock_commit(buffer, event, 0, 0);
>  }
>  
> -static int reg_event_syscall_enter(struct ftrace_event_call *call)
> +static int reg_event_syscall_enter(struct ftrace_event_file *file,
> +				   struct ftrace_event_call *call)
>  {
> +	struct trace_array *tr = file->tr;
>  	int ret = 0;
>  	int num;
>  
> @@ -379,33 +381,37 @@ static int reg_event_syscall_enter(struct ftrace_event_call *call)
>  	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>  		return -ENOSYS;
>  	mutex_lock(&syscall_trace_lock);
> -	if (!sys_refcount_enter)
> -		ret = register_trace_sys_enter(ftrace_syscall_enter, NULL);
> +	if (!tr->sys_refcount_enter)
> +		ret = register_trace_sys_enter(ftrace_syscall_enter, tr);
>  	if (!ret) {
> -		set_bit(num, enabled_enter_syscalls);
> -		sys_refcount_enter++;
> +		set_bit(num, tr->enabled_enter_syscalls);
> +		tr->sys_refcount_enter++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
>  	return ret;
>  }
>  
> -static void unreg_event_syscall_enter(struct ftrace_event_call *call)
> +static void unreg_event_syscall_enter(struct ftrace_event_file *file,
> +				      struct ftrace_event_call *call)
>  {
> +	struct trace_array *tr = file->tr;
>  	int num;
>  
>  	num = ((struct syscall_metadata *)call->data)->syscall_nr;
>  	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>  		return;
>  	mutex_lock(&syscall_trace_lock);
> -	sys_refcount_enter--;
> -	clear_bit(num, enabled_enter_syscalls);
> -	if (!sys_refcount_enter)
> -		unregister_trace_sys_enter(ftrace_syscall_enter, NULL);
> +	tr->sys_refcount_enter--;
> +	clear_bit(num, tr->enabled_enter_syscalls);
> +	if (!tr->sys_refcount_enter)
> +		unregister_trace_sys_enter(ftrace_syscall_enter, tr);
>  	mutex_unlock(&syscall_trace_lock);
>  }
>  
> -static int reg_event_syscall_exit(struct ftrace_event_call *call)
> +static int reg_event_syscall_exit(struct ftrace_event_file *file,
> +				  struct ftrace_event_call *call)
>  {
> +	struct trace_array *tr = file->tr;
>  	int ret = 0;
>  	int num;
>  
> @@ -413,28 +419,30 @@ static int reg_event_syscall_exit(struct ftrace_event_call *call)
>  	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>  		return -ENOSYS;
>  	mutex_lock(&syscall_trace_lock);
> -	if (!sys_refcount_exit)
> -		ret = register_trace_sys_exit(ftrace_syscall_exit, NULL);
> +	if (!tr->sys_refcount_exit)
> +		ret = register_trace_sys_exit(ftrace_syscall_exit, tr);
>  	if (!ret) {
> -		set_bit(num, enabled_exit_syscalls);
> -		sys_refcount_exit++;
> +		set_bit(num, tr->enabled_exit_syscalls);
> +		tr->sys_refcount_exit++;
>  	}
>  	mutex_unlock(&syscall_trace_lock);
>  	return ret;
>  }
>  
> -static void unreg_event_syscall_exit(struct ftrace_event_call *call)
> +static void unreg_event_syscall_exit(struct ftrace_event_file *file,
> +				     struct ftrace_event_call *call)
>  {
> +	struct trace_array *tr = file->tr;
>  	int num;
>  
>  	num = ((struct syscall_metadata *)call->data)->syscall_nr;
>  	if (WARN_ON_ONCE(num < 0 || num >= NR_syscalls))
>  		return;
>  	mutex_lock(&syscall_trace_lock);
> -	sys_refcount_exit--;
> -	clear_bit(num, enabled_exit_syscalls);
> -	if (!sys_refcount_exit)
> -		unregister_trace_sys_exit(ftrace_syscall_exit, NULL);
> +	tr->sys_refcount_exit--;
> +	clear_bit(num, tr->enabled_exit_syscalls);
> +	if (!tr->sys_refcount_exit)
> +		unregister_trace_sys_exit(ftrace_syscall_exit, tr);
>  	mutex_unlock(&syscall_trace_lock);
>  }
>  
> @@ -471,7 +479,7 @@ struct trace_event_functions exit_syscall_print_funcs = {
>  	.trace		= print_syscall_exit,
>  };
>  
> -struct ftrace_event_class event_class_syscall_enter = {
> +struct ftrace_event_class __refdata event_class_syscall_enter = {
>  	.system		= "syscalls",
>  	.reg		= syscall_enter_register,
>  	.define_fields	= syscall_enter_define_fields,
> @@ -479,7 +487,7 @@ struct ftrace_event_class event_class_syscall_enter = {
>  	.raw_init	= init_syscall_trace,
>  };
>  
> -struct ftrace_event_class event_class_syscall_exit = {
> +struct ftrace_event_class __refdata event_class_syscall_exit = {
>  	.system		= "syscalls",
>  	.reg		= syscall_exit_register,
>  	.define_fields	= syscall_exit_define_fields,
> @@ -685,11 +693,13 @@ static void perf_sysexit_disable(struct ftrace_event_call *call)
>  static int syscall_enter_register(struct ftrace_event_call *event,
>  				 enum trace_reg type, void *data)
>  {
> +	struct ftrace_event_file *file = data;
> +
>  	switch (type) {
>  	case TRACE_REG_REGISTER:
> -		return reg_event_syscall_enter(event);
> +		return reg_event_syscall_enter(file, event);
>  	case TRACE_REG_UNREGISTER:
> -		unreg_event_syscall_enter(event);
> +		unreg_event_syscall_enter(file, event);
>  		return 0;
>  
>  #ifdef CONFIG_PERF_EVENTS
> @@ -711,11 +721,13 @@ static int syscall_enter_register(struct ftrace_event_call *event,
>  static int syscall_exit_register(struct ftrace_event_call *event,
>  				 enum trace_reg type, void *data)
>  {
> +	struct ftrace_event_file *file = data;
> +
>  	switch (type) {
>  	case TRACE_REG_REGISTER:
> -		return reg_event_syscall_exit(event);
> +		return reg_event_syscall_exit(file, event);
>  	case TRACE_REG_UNREGISTER:
> -		unreg_event_syscall_exit(event);
> +		unreg_event_syscall_exit(file, event);
>  		return 0;
>  
>  #ifdef CONFIG_PERF_EVENTS
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ