linux-kernel - Re: [PATCH] ftrace: dump out ftrace buffers to console on panic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 30 Jul 2008 20:39:47 -0400 (EDT)
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Andrew Morton <akpm@...ux-foundation.org>
cc:	LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] ftrace: dump out ftrace buffers to console on panic


On Wed, 30 Jul 2008, Andrew Morton wrote:

> On Wed, 30 Jul 2008 17:13:24 -0400 (EDT)
> Steven Rostedt <rostedt@...dmis.org> wrote:
>
>>
>> At OLS I had a lot of interest to be able to have the ftrace buffers
>> dumped on panic.  Usually one would expect to uses kexec and examine
>> the buffers after a new kernel is loaded. But sometimes the resources
>> do not permit kdump and kexec, so having an option to still see the
>> sequence of events up to the crash is very advantageous.
>>
>> This patch adds the option to have the ftrace buffers dumped to the
>> console in the latency_trace format on a panic. When the option is set,
>> the default entries per CPU buffer are lowered to 16384, since the writing
>> to the serial (if that is the console) may take an aweful long time
>> otherwise.
>
> The modern spelling is "awful" ;)

I need to get spell check working with alpine :-/

>
>> Signed-off-by: Steven Rostedt <srostedt@...hat.com>
>> ---
>>   include/linux/ftrace.h |    6 +
>>   kernel/trace/Kconfig   |   10 ++
>>   kernel/trace/trace.c   |  179 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 194 insertions(+), 1 deletion(-)
>>
>> Index: linux-tip.git/kernel/trace/trace.c
>> ===================================================================
>> --- linux-tip.git.orig/kernel/trace/trace.c	2008-07-29 10:47:53.000000000 -0400
>> +++ linux-tip.git/kernel/trace/trace.c	2008-07-30 16:46:15.000000000 -0400
>> @@ -14,6 +14,7 @@
>>   #include <linux/utsrelease.h>
>>   #include <linux/kallsyms.h>
>>   #include <linux/seq_file.h>
>> +#include <linux/notifier.h>
>
> Your email client space-stuffed the patch.

Again, I just switched from pine to alpine, and ^R is not pulling in 
patches correctly. I thought I fixed it, but I guess I did not :-(

>
>>   #include <linux/debugfs.h>
>>   #include <linux/pagemap.h>
>>   #include <linux/hardirq.h>
>> @@ -22,6 +23,7 @@
>>   #include <linux/ftrace.h>
>>   #include <linux/module.h>
>>   #include <linux/percpu.h>
>> +#include <linux/kdebug.h>
>>   #include <linux/ctype.h>
>>   #include <linux/init.h>
>>   #include <linux/poll.h>
>> @@ -104,7 +106,16 @@ int				ftrace_function_enabled;
>>    * for a buffer. Note, the number of entries is always rounded
>>    * to ENTRIES_PER_PAGE.
>>    */
>> -static unsigned long		trace_nr_entries = 65536UL;
>> +#ifdef CONFIG_FTRACE_DUMP_ON_OOPS
>> +/*
>> + * If dump on oops is defined, lower the number of entries.
>> + * Usually this goes over the serial, and is used for debugging.
>
> I suspect that (netconsole > serial) nowadays.

Hmm, I find that serial is more reliable for me. Also it is what I usually 
have when I debug a machine remotely.

>
>> + */
>> +# define TRACE_ENTRIES_DEFAULT	16384UL
>> +#else
>> +# define TRACE_ENTRIES_DEFAULT	65536UL
>> +#endif
>> +static unsigned long		trace_nr_entries = TRACE_ENTRIES_DEFAULT;
>>
>>   /* trace_types holds a link list of available tracers. */
>>   static struct tracer		*trace_types __read_mostly;
>> @@ -2930,6 +2941,166 @@ static __init void tracer_init_debugfs(v
>>   #endif
>>   }
>>
>> +#ifdef CONFIG_FTRACE_DUMP_ON_OOPS
>> +static int trace_panic_handler(struct notifier_block *this,
>> +			       unsigned long event, void *unused)
>> +{
>> +	ftrace_dump();
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block trace_panic_notifier = {
>> +	.notifier_call  = trace_panic_handler,
>> +	.next           = NULL,
>> +	.priority       = 150   /* priority: INT_MAX >= x >= 0 */
>> +};
>> +
>> +static int trace_die_handler(struct notifier_block *self,
>> +			     unsigned long val,
>> +			     void *data)
>> +{
>> +	switch (val) {
>> +	case DIE_OOPS:
>> +		ftrace_dump();
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +	return NOTIFY_OK;
>> +}
>> +
>> +static struct notifier_block trace_die_notifier = {
>> +	.notifier_call = trace_die_handler,
>> +	.priority = 200
>> +};
>> +
>> +static void
>> +trace_printk_seq(struct trace_seq *s)
>> +{
>> +	int len = s->len >= PAGE_SIZE ? PAGE_SIZE - 1 : s->len;
>
> That's an open-coded min().
>
> Beware that some architectures' PAGE_SIZE has unsigned long type and
> others have a different type (unsigned?) so a min() on PAGE_SIZE will
> warn on some architectures and not on others.  min_t will fix that.

Thanks. This is cut and paste from another part of the code. I'll fix that 
too.

>
>> +	s->buffer[len] = 0;
>>
>> +	printk("%s", s->buffer);
>
> If the null-termination of ->buffer[] was actually needed then we have
> a big problem here.

hehe, yeah, my first run filled up the page before printing, and I saw the 
result of that.  I changed the code later to print at each iteration, 
which is much less than a page size. It should pretty much fit on a line.
Since the dump is the only user, I could just print the buffer without the 
checks. I can put a warn on if the len is greater than some max, and 
return.

>
> I'd suggest a conversion to vsnprintf().

??

>
>> +	trace_seq_reset(s);
>> +}
>> +
>> +
>> +static struct trace_iterator dump_iter;
>> +static raw_spinlock_t ftrace_dump_lock =
>> +	(raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
>
> argh.  Please never do an uncommented raw_spinlock_t.  It's just
> impossible for your readers to work out ytf it was chosen.

OK. I'll add the comment. The reason for doing this is because it is a 
special case in oops and not for normal use. If an oops happened in the
lockdep code, I don't want it to recursive fault here.

>
>> +static int dump_running;
>> +
>> +void ftrace_dump(void)
>> +{
>> +	struct trace_iterator *iter = &dump_iter;
>> +	struct trace_array_cpu *data;
>> +	static cpumask_t mask;
>> +	unsigned long flags;
>> +	int cpu, ret;
>> +	int cnt = 0;
>> +
>> +	local_irq_save(flags);
>> +
>> +	/* only one dump at a time */
>> +	__raw_spin_lock(&ftrace_dump_lock);
>
> Should we have a __raw_spinlock_irqsave()?

I'd like one but one doesn't exist. Actually, it is good we don't have one 
because __raw_spin_lock should rarely be used.

Hmm, I think I should just change the whole thing back to normal 
spin_locks and just hope that lockdep never oops ;-)


>
>> +	if (dump_running)
>> +		ret = 1;
>> +	else {
>> +		dump_running = 1;
>> +		ret = 0;
>> +	}
>> +	__raw_spin_unlock(&ftrace_dump_lock);
>> +
>> +	if (ret)
>> +		goto out;
>> +
>> +	/* No turning back! */
>> +	ftrace_kill_atomic();
>> +
>> +	printk("Dumping ftrace buffer:\n");
>
> didst thou run checkpatch?

Damn, I thought I did, but obviously I didn't!

>
>> +	memset(iter, 0, sizeof(*iter));
>
> This is zeroing out dump_iter, which is already zeroed.  Is it needed?

Good point. No it isn't needed. I'll replace that with a comment.

>
>> +	iter->tr = &global_trace;
>> +	iter->trace = current_trace;
>
> The use of local variable `iter' instead of simply using direct access
> to dump_iter declarifies things.  Perhaps it generates significantly
> better code?  If not, I'd suggest doing it the simple way.
>
> dump_iter could be made local to this function.

I can make it local to this function. The reason for using iter, is that I 
cut and pasted other code and modified it from there.

>
>> +	/*
>> +	 * We need to stop all tracing on all CPUS to read the
>> +	 * the next buffer. This is a bit expensive, but is
>> +	 * not done often. We fill all what we can read,
>> +	 * and then release the locks again.
>> +	 */
>> +
>> +	cpus_clear(mask);
>> +
>> +	for_each_tracing_cpu(cpu) {
>> +		data = iter->tr->data[cpu];
>> +
>> +		if (!head_page(data) || !data->trace_idx)
>> +			continue;
>> +
>> +		atomic_inc(&data->disabled);
>> +		cpu_set(cpu, mask);
>> +	}
>> +
>> +	for_each_cpu_mask(cpu, mask) {
>> +		data = iter->tr->data[cpu];
>> +		__raw_spin_lock(&data->lock);
>> +
>> +		if (data->overrun > iter->last_overrun[cpu])
>> +			iter->overrun[cpu] +=
>> +				data->overrun - iter->last_overrun[cpu];
>> +		iter->last_overrun[cpu] = data->overrun;
>> +	}
>> +
>> +	while (!trace_empty(iter)) {
>> +
>> +		if (!cnt)
>> +			printk("---------------------------------\n");
>> +
>> +		cnt++;
>> +
>> +		/* reset all but tr, trace, and overruns */
>> +		memset(&iter->seq, 0,
>> +		       sizeof(struct trace_iterator) -
>> +		       offsetof(struct trace_iterator, seq));
>> +		iter->iter_flags |= TRACE_FILE_LAT_FMT;
>> +		iter->pos = -1;
>> +
>> +		if (find_next_entry_inc(iter) != NULL) {
>> +			print_trace_line(iter);
>> +			trace_consume(iter);
>> +		}
>> +
>> +		trace_printk_seq(&iter->seq);
>> +	}
>> +
>> +	if (!cnt)
>> +		printk("   (ftrace buffer empty)\n");
>> +	else
>> +		printk("---------------------------------\n");
>> +
>> +	for_each_cpu_mask(cpu, mask) {
>> +		data = iter->tr->data[cpu];
>> +		__raw_spin_unlock(&data->lock);
>> +	}
>> +
>> +	for_each_cpu_mask(cpu, mask) {
>> +		data = iter->tr->data[cpu];
>> +		atomic_dec(&data->disabled);
>> +	}
>
> Can we combine those two loops into one?

Yes but I kept it this way to match the semantics of the taking and 
incrementing of the disabled atomic variable.  The disabling is done first 
but with a check to see if the buffers even exist, and we update mask to 
match.  Then we run the loop on the match and take the locks.

I feel safer on keeping the grabing of spinlocks and disabling the 
buffers separate, as well as releasing them.

>
>> +	__raw_spin_lock(&ftrace_dump_lock);
>> +	dump_running = 0;
>> +	__raw_spin_unlock(&ftrace_dump_lock);
>> +
>> + out:
>> +	local_irq_restore(flags);
>> +}
>> +#endif /* CONFIG_FTRACE_DUMP_ON_OOPS */
>
> Do we really need a new config option for this?  Would it hurt too much
> to make it unconditionally available?

I don't mind making it unconditionally available. I only added the config 
option because I thought people would like to turn it off.

I'll resubmit with the above concerns fixed and without the config option.

Thanks Andrew, as always, you give constructive feedback.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/