linux-kernel - Re: [RFC PATCH -tip ] tracing: make a snapshot feature available from userspace.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1342049178.14828.67.camel@gandalf.stny.rr.com>
Date:	Wed, 11 Jul 2012 19:26:18 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Hiraku Toyooka <hiraku.toyooka.gu@...achi.com>
Cc:	yrl.pp-manager.tt@...achi.com,
	Frederic Weisbecker <fweisbec@...il.com>,
	Ingo Molnar <mingo@...hat.com>, Rob Landley <rob@...dley.net>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH -tip ] tracing: make a snapshot feature available
 from userspace.

Sorry for taking so long, I've been hammering away at other things. :-/


On Tue, 2012-06-05 at 21:06 +0900, Hiraku Toyooka wrote:
> Hello,
> 
> This patch provides new debugfs interfaces for taking a snapshot
> in ftrace.
> 
> Ftrace is available as a flight recorder. When a kernel panic
> occurs, we can get trace data from the dumped kernel image.
> 
> But in case of some systems, non-critical errors (such as
> application's errors) are difficult to handle because of following
> reasons.
> 
>  - When the system has to continue to run, it can't stop tracing.
>  - Even then, failure analysis is necessary to prevent the same
>    errors.
> 
> So, I propose making the ftrace's snapshot feature available from
> userspace to solve above dilemma.
> (See also
> https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_toyooka.pdf)
> 
> This enables to preserve trace buffer at a particular point in time
> without stopping tracing.  When a snapshot is taken, ftrace swaps the
> current buffer with a spare buffer which is prepared in advance. This
> means that the tracing itself continues on the spare buffer.
> 
> Currently, ftrace has a snapshot feature available from kernel space
> and some tracers (e.g. irqsoff) are using it. This patch allows users
> to use the same snapshot feature via debugfs.
> 
> I added two debugfs files in "tracing" directory.
> 
>   snapshot_enabled:
> 
>         This is used to set or display whether the snapshot is
>         enabled. Echo 1 into this file to prepare a spare buffer
>         or 0 to shrink it. So, the memory for the spare buffer
>         will be consumed only when this knob is set.
> 
>   snapshot_pipe:
> 
>         This is used to take a snapshot and to read the output
>         of the snapshot (max_tr). Echo 1 into this file to take a
>         snapshot. Reads from this file is the same as the
>         "trace_pipe" file.
> 
> Here is an example of using the snapshot feature.
> 
>  # echo nop > current_tracer
>  # echo 1 > snapshot_enabled
>  # echo 1 > events/sched/enable
>  [...]
>  # echo 1 > snapshot_pipe
>  # cat snapshot_pipe
>             bash-3352  [001] dN.. 18440.883932: sched_wakeup: comm=migration/6 p
> id=28 prio=0 success=1 target_cpu=006
>             bash-3352  [001] dN.. 18440.883933: sched_wakeup: comm=migration/7 p
> id=32 prio=0 success=1 target_cpu=007
>             bash-3352  [001] d... 18440.883935: sched_switch: prev_comm=bash pre
> v_pid=3352 prev_prio=120 prev_state=R ==> next_comm=migration/1 next_pid=8 next_
> prio=0
> [...]
> 
> 
> Signed-off-by: Hiraku Toyooka <hiraku.toyooka.gu@...achi.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Frederic Weisbecker <fweisbec@...il.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Rob Landley <rob@...dley.net>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>
> Cc: linux-doc@...r.kernel.org
> Cc: linux-kernel@...r.kernel.org
> ---
> 
>  Documentation/trace/ftrace.txt    |   47 +++++++++++
>  kernel/trace/trace.c              |  152 +++++++++++++++++++++++++++++++++++++
>  kernel/trace/trace.h              |    8 ++
>  kernel/trace/trace_irqsoff.c      |    3 +
>  kernel/trace/trace_sched_wakeup.c |    2 
>  5 files changed, 210 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
> index 6f51fed..df9fa13 100644
> --- a/Documentation/trace/ftrace.txt
> +++ b/Documentation/trace/ftrace.txt
> @@ -1842,6 +1842,53 @@ an error.
>   # cat buffer_size_kb
>  85
>  
> +
> +Snapshot
> +--------
> +If CONFIG_TRACER_MAX_TRACE is set, the (generic) snapshot
> +feature is available in all tracers except for the special
> +tracers which use a snapshot inside themselves(such as "irqsoff"
> +or "wakeup").

I find this kind of ironic, that it is only defined when one of the
tracers that can't use it defines it.

Maybe we should make it a prompt config for this feature.

> +
> +This enables to preserve trace buffer at a particular point in
> +time without stopping tracing.  When a snapshot is taken, ftrace
> +swaps the current buffer with a spare buffer which is prepared
> +in advance. This means that the tracing itself continues on the
> +spare buffer.
> +
> +Following debugfs files in "tracing" directory are related with
> +this feature.
> +
> +  snapshot_enabled:
> +
> +	This is used to set or display whether the snapshot is
> +	enabled. Echo 1 into this file to prepare a spare buffer
> +	or 0 to shrink it. So, the memory for the spare buffer
> +	will be consumed only when this knob is set.
> +
> +  snapshot_pipe:
> +
> +	This is used to take a snapshot and to read the output
> +	of the snapshot. Echo 1 into this file to take a
> +	snapshot. Reads from this file is the same as the
> +	"trace_pipe" file (described above "The File System"
> +	section), so that both reads from the snapshot and
> +	tracing are executable in parallel.

I don't really like the name snapshot_pipe. What about just calling it
snapshot, and just document that it works like trace_pipe?

Also, rename snapshot_enabled, to snapshot_allocate. If someone echos 1
into snapshot, it would automatically allocate the buffer (and set
snapshot_allocate to 1). If you don't want the delay (for allocation),
then you can do the echo 1 into snapshot_allocate first, and it would
behave as it does here.


> +
> +Here is an example of using the snapshot feature.
> +
> + # echo nop > current_tracer
> + # echo 1 > snapshot_enabled
> + # echo 1 > events/sched/enable
> + [...]
> + # echo 1 > snapshot_pipe
> + # cat snapshot_pipe
> +            bash-3352  [001] dN.. 18440.883932: sched_wakeup: comm=migration/6 pid=28 prio=0 success=1 target_cpu=006
> +            bash-3352  [001] dN.. 18440.883933: sched_wakeup: comm=migration/7 pid=32 prio=0 success=1 target_cpu=007
> +            bash-3352  [001] d... 18440.883935: sched_switch: prev_comm=bash prev_pid=3352 prev_prio=120 prev_state=R ==> next_comm=migration/1 next_pid=8 next_prio=0
> +[...]

BTW, why make it a pipe action anyway? As a snapshot doesn't have a
writer to it, doing just an iterate over the snapshot would make sense,
wouldn't it?

If you reply with a good rational for keeping the snapshot_pipe, then we
should have both snapshot and snapshot_pipe, where snapshot works like
trace and snapshot_pipe works like trace_pipe.

> +
> +
>  -----------
>  
>  More details can be found in the source code, in the
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 68032c6..8e722c3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3266,7 +3266,8 @@ tracing_max_lat_write(struct file *filp, const char __user *ubuf,
>  	return cnt;
>  }
>  
> -static int tracing_open_pipe(struct inode *inode, struct file *filp)
> +static int __tracing_open_pipe(struct inode *inode, struct file *filp,
> +			       struct trace_array *tr)
>  {
>  	long cpu_file = (long) inode->i_private;
>  	struct trace_iterator *iter;
> @@ -3308,7 +3309,7 @@ static int tracing_open_pipe(struct inode *inode, struct file *filp)
>  		iter->iter_flags |= TRACE_FILE_LAT_FMT;
>  
>  	iter->cpu_file = cpu_file;
> -	iter->tr = &global_trace;
> +	iter->tr = tr;
>  	mutex_init(&iter->mutex);
>  	filp->private_data = iter;
>  
> @@ -3327,6 +3328,11 @@ fail:
>  	return ret;
>  }
>  
> +static int tracing_open_pipe(struct inode *inode, struct file *filp)
> +{
> +	return __tracing_open_pipe(inode, filp, &global_trace);
> +}
> +
>  static int tracing_release_pipe(struct inode *inode, struct file *file)
>  {
>  	struct trace_iterator *iter = file->private_data;
> @@ -4008,6 +4014,122 @@ static int tracing_clock_open(struct inode *inode, struct file *file)
>  	return single_open(file, tracing_clock_show, NULL);
>  }
>  
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +static int tracing_open_snapshot_pipe(struct inode *inode, struct file *filp)
> +{
> +	return __tracing_open_pipe(inode, filp, &max_tr);
> +}
> +
> +static ssize_t
> +tracing_write_snapshot_pipe(struct file *filp, const char __user *ubuf,
> +		       size_t cnt, loff_t *ppos)
> +{
> +	unsigned long val = 0;
> +	int ret;
> +
> +	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&trace_types_lock);
> +
> +	/* If current tracer's use_max_tr == 0, we prevent taking a snapshot */

Here we should just allocate it first.

> +	if (!current_trace->use_max_tr) {

I also have issues with the use of 'use_max_tr' here, but I'll explain
that below.

> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	if (val) {
> +		unsigned long flags;
> +		local_irq_save(flags);

Interrupts will never be disabled here. Just use
'local_irq_disable/enable()', and remove flags.

> +		update_max_tr(&global_trace, current, raw_smp_processor_id());

Also, get rid of the 'raw_' that's for critical paths that can be broken
by the debug version of the normal user (like in function tracing and
callbacks from disabling interrupts).


> +		local_irq_restore(flags);
> +	}
> +out:
> +	mutex_unlock(&trace_types_lock);
> +
> +	if (!ret) {
> +		*ppos += cnt;
> +		ret = cnt;
> +	}
> +
> +	return ret;
> +}
> +
> +static ssize_t
> +tracing_snapshot_ctrl_read(struct file *filp, char __user *ubuf,
> +			   size_t cnt, loff_t *ppos)
> +{
> +	char buf[64];
> +	int r;
> +
> +	r = sprintf(buf, "%d\n", current_trace->use_max_tr);
> +	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
> +}
> +
> +static ssize_t
> +tracing_snapshot_ctrl_write(struct file *filp, const char __user *ubuf,
> +			    size_t cnt, loff_t *ppos)
> +{
> +	unsigned long val;
> +	int ret;
> +
> +	ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
> +	if (ret)
> +		return ret;
> +
> +	val = !!val;
> +
> +	mutex_lock(&trace_types_lock);
> +	tracing_stop();
> +	arch_spin_lock(&ftrace_max_lock);
> +
> +	/* When always_use_max_tr == 1, we can't toggle use_max_tr. */
> +	if (current_trace->always_use_max_tr) {

I'll state my issue here. Don't rename use_max_tr to always_use_max_tr,
keep it as is and its use as is. Your other value should be
"allocated_snapshot", which can be set even for the use_max_tr user.


> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	if (!(current_trace->use_max_tr ^ val))
> +		goto out;
> +
> +	if (val) {
> +		int cpu;
> +		for_each_tracing_cpu(cpu) {
> +			ret = ring_buffer_resize(max_tr.buffer,
> +						global_trace.data[cpu]->entries,
> +						cpu);
> +			if (ret < 0)
> +				break;
> +			max_tr.data[cpu]->entries =
> +				global_trace.data[cpu]->entries;
> +		}
> +		if (ret < 0) {
> +			ring_buffer_resize(max_tr.buffer, 1,
> +					   RING_BUFFER_ALL_CPUS);
> +			set_buffer_entries(&max_tr, 1);
> +			ret = -ENOMEM;
> +			goto out;
> +		}

The above code is basically duplicated by the
__tracing_resize_ring_buffer(). As this code is not that trivial, lets
make use of a helper function and keep the bugs in one location. Have
both this function and the resize function use the same code.

In fact, the __tracing_resize_ring_buffer() could be modified to do all
the work. It will either shrink or expand as necessary. This isn't a
critical section so calling ring_buffer_resize() even when there's
nothing to do should not be an issue.

In fact, I think there's a small bug in the code that you just
duplicated. Not your bug, but you copied it.


> +		current_trace->use_max_tr = 1;
> +	} else {
> +		/* shrink buffer for saving memory */
> +		ring_buffer_resize(max_tr.buffer, 1, RING_BUFFER_ALL_CPUS);
> +		set_buffer_entries(&max_tr, 1);
> +		current_trace->use_max_tr = 0;
> +	}
> +out:
> +	arch_spin_unlock(&ftrace_max_lock);
> +	tracing_start();
> +	mutex_unlock(&trace_types_lock);
> +
> +	if (!ret) {
> +		*ppos += cnt;
> +		ret = cnt;
> +	}
> +	return ret;
> +}
> +#endif /* CONFIG_TRACER_MAX_TRACE */
> +
>  static const struct file_operations tracing_max_lat_fops = {
>  	.open		= tracing_open_generic,
>  	.read		= tracing_max_lat_read,
> @@ -4071,6 +4193,25 @@ static const struct file_operations trace_clock_fops = {
>  	.write		= tracing_clock_write,
>  };
>  
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +static const struct file_operations tracing_snapshot_pipe_fops = {
> +	.open           = tracing_open_snapshot_pipe,
> +	.poll           = tracing_poll_pipe,
> +	.read           = tracing_read_pipe,
> +	.write          = tracing_write_snapshot_pipe,
> +	.splice_read    = tracing_splice_read_pipe,
> +	.release        = tracing_release_pipe,
> +	.llseek         = no_llseek,
> +};
> +
> +static const struct file_operations tracing_snapshot_ctrl_fops = {
> +	.open           = tracing_open_generic,
> +	.read           = tracing_snapshot_ctrl_read,
> +	.write          = tracing_snapshot_ctrl_write,
> +	.llseek         = generic_file_llseek,
> +};
> +#endif /* CONFIG_TRACER_MAX_TRACE */
> +
>  struct ftrace_buffer_info {
>  	struct trace_array	*tr;
>  	void			*spare;
> @@ -4819,6 +4960,13 @@ static __init int tracer_init_debugfs(void)
>  #ifdef CONFIG_TRACER_MAX_TRACE
>  	trace_create_file("tracing_max_latency", 0644, d_tracer,
>  			&tracing_max_latency, &tracing_max_lat_fops);
> +
> +	trace_create_file("snapshot_pipe", 0644, d_tracer,
> +			  (void *) TRACE_PIPE_ALL_CPU,
> +			  &tracing_snapshot_pipe_fops);
> +
> +	trace_create_file("snapshot_enabled", 0644, d_tracer,
> +			  NULL, &tracing_snapshot_ctrl_fops);
>  #endif
>  
>  	trace_create_file("tracing_thresh", 0644, d_tracer,
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 5aec220..602657a 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -254,6 +254,8 @@ struct tracer_flags {
>   * @print_line: callback that prints a trace
>   * @set_flag: signals one of your private flags changed (trace_options file)
>   * @flags: your private flags
> + * @use_max_tr: show whether the snapshot feature is available on this tracer
> + * @always_use_max_tr: show that max_tr is always used on this tracer
>   */
>  struct tracer {
>  	const char		*name;
> @@ -286,7 +288,13 @@ struct tracer {
>  	struct tracer		*next;
>  	struct tracer_flags	*flags;
>  	int			print_max;
> +	/* Dynamically toggled via "snapshot_enabled" debugfs file */
>  	int			use_max_tr;
> +	/*
> +	 * If this value is 1, this tracer always uses max_tr and "use_max_tr"
> +	 * can't be toggled.
> +	 */
> +	int			always_use_max_tr;

I already said how I dislike this. Leave use_max_tr alone, but add a
allocated_snapshot. Also, I hate the wasting of 4 bytes just to act like
a flag. We should probably make print_max, use_max_tr and
always_use_max_tr into 'bool's.

The print_max change should be a separate patch.

>  };
>  
> 
> diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
> index 99d20e9..37cdb75 100644
> --- a/kernel/trace/trace_irqsoff.c
> +++ b/kernel/trace/trace_irqsoff.c
> @@ -614,6 +614,7 @@ static struct tracer irqsoff_tracer __read_mostly =
>  	.open           = irqsoff_trace_open,
>  	.close          = irqsoff_trace_close,
>  	.use_max_tr	= 1,
> +	.always_use_max_tr	= 1,

Remove all these. Have the 'allocated_snapshot' get set when the tracer
is added, not here.

But on the whole, I like the idea of a snapshot (and this has actually
been on my todo list for some time, thanks for doing it for me ;-)

-- Steve


>  };
>  # define register_irqsoff(trace) register_tracer(&trace)
>  #else
> @@ -647,6 +648,7 @@ static struct tracer preemptoff_tracer __read_mostly =
>  	.open		= irqsoff_trace_open,
>  	.close		= irqsoff_trace_close,
>  	.use_max_tr	= 1,
> +	.always_use_max_tr	= 1,
>  };
>  # define register_preemptoff(trace) register_tracer(&trace)
>  #else
> @@ -682,6 +684,7 @@ static struct tracer preemptirqsoff_tracer __read_mostly =
>  	.open		= irqsoff_trace_open,
>  	.close		= irqsoff_trace_close,
>  	.use_max_tr	= 1,
> +	.always_use_max_tr	= 1,
>  };
>  
>  # define register_preemptirqsoff(trace) register_tracer(&trace)
> diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
> index ff791ea..b6e91a0 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -599,6 +599,7 @@ static struct tracer wakeup_tracer __read_mostly =
>  	.open		= wakeup_trace_open,
>  	.close		= wakeup_trace_close,
>  	.use_max_tr	= 1,
> +	.always_use_max_tr	= 1,
>  };
>  
>  static struct tracer wakeup_rt_tracer __read_mostly =
> @@ -620,6 +621,7 @@ static struct tracer wakeup_rt_tracer __read_mostly =
>  	.open		= wakeup_trace_open,
>  	.close		= wakeup_trace_close,
>  	.use_max_tr	= 1,
> +	.always_use_max_tr	= 1,
>  };
>  
>  __init static int init_wakeup_tracer(void)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/