[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALWz4iwdFEatcoZ80yVt9r9Uf6LdcbHatTQdM8fcM0yWqcH2cQ@mail.gmail.com>
Date: Tue, 7 Feb 2012 16:12:17 -0800
From: Ying Han <yinghan@...gle.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
Rik van Riel <riel@...hat.com>,
Hugh Dickins <hughd@...gle.com>, Ingo Molnar <mingo@...e.hu>,
"H. Peter Anvin" <hpa@...or.com>, tglx@...utronix.de
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
Greg Thelen <gthelen@...gle.com>, x86@...nel.org
Subject: Re: [PATCH V7] Eliminate task stack trace duplication
On Tue, Feb 7, 2012 at 2:58 PM, Ying Han <yinghan@...gle.com> wrote:
> The problem with small dmesg ring buffer like 512k is that only limited number
> of task traces will be logged. Sometimes we lose important information only
> because of too many duplicated stack traces. This problem occurs when dumping
> lots of stacks in a single operation, such as sysrq-T.
>
> This patch tries to reduce the duplication of task stack trace in the dump
> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
> during bootup. Each time if we find the identical task trace in the task stack,
> we dump only the pid of the task which has the task trace dumped. So it is easy
> to back track to the full stack with the pid.
>
> When we do the hashing, we eliminate garbage entries from stack traces. Those
> entries are still being printed in the dump to provide more debugging
> informations.
>
> [ 58.469730] kworker/0:0 S 0000000000000000 0 4 2 0x00000000
> [ 58.469735] ffff88082fcfde80 0000000000000046 ffff88082e9d8000 ffff88082fcfc010
> [ 58.469739] ffff88082fce9860 0000000000011440 ffff88082fcfdfd8 ffff88082fcfdfd8
> [ 58.469743] 0000000000011440 0000000000000000 ffff88082fcee180 ffff88082fce9860
> [ 58.469747] Call Trace:
> [ 58.469751] [<ffffffff8108525a>] worker_thread+0x24b/0x250
> [ 58.469754] [<ffffffff8108500f>] ? manage_workers+0x192/0x192
> [ 58.469757] [<ffffffff810885bd>] kthread+0x82/0x8a
> [ 58.469760] [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
> [ 58.469763] [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
> [ 58.469765] [<ffffffff8141aed0>] ? gs_change+0xb/0xb
> [ 58.469768] kworker/u:0 S 0000000000000004 0 5 2 0x00000000
> [ 58.469773] ffff88082fcffe80 0000000000000046 ffff880800000000 ffff88082fcfe010
> [ 58.469777] ffff88082fcea080 0000000000011440 ffff88082fcfffd8 ffff88082fcfffd8
> [ 58.469781] 0000000000011440 0000000000000000 ffff88082fd4e9a0 ffff88082fcea080
> [ 58.469785] Call Trace:
> [ 58.469786] <Same stack as pid 4>
> [ 58.470235] kworker/0:1 S 0000000000000000 0 13 2 0x00000000
> [ 58.470255] ffff88082fd3fe80 0000000000000046 ffff880800000000 ffff88082fd3e010
> [ 58.470279] ffff88082fcee180 0000000000011440 ffff88082fd3ffd8 ffff88082fd3ffd8
> [ 58.470301] 0000000000011440 0000000000000000 ffffffff8180b020 ffff88082fcee180
> [ 58.470325] Call Trace:
> [ 58.470332] <Same stack as pid 4>
>
> changelog v7..v6:
> 1. rebase on v3.3_rc2, the only change is moving changes from kernel/sched.c
> to kernel/sched/core.c
>
> changelog v6..v5:
> 1. clear saved stack trace before printing a set of stacks. this ensures the printed
> stack traces are not omitted messages.
> 2. add log level in printing duplicate stack.
> 3. remove the show_stack() API change, and non-x86 arch won't need further change.
> 4. add more inline documentations.
>
> changelog v5..v4:
> 1. removed changes to Kconfig file
> 2. changed hashtable to keep only hash value and length of stack
> 3. simplified hashtable lookup
>
> changelog v4..v3:
> 1. improve de-duplication by eliminating garbage entries from stack traces.
> with this change 793/825 stack traces were recognized as duplicates. in v3
> only 482/839 were duplicates.
>
> changelog v3..v2:
> 1. again better documentation on the patch description.
> 2. make the stack_hash_table to be allocated at compile time.
> 3. have better name of variable index
> 4. move save_dup_stack_trace() in kernel/stacktrace.c
>
> changelog v2..v1:
> 1. better documentation on the patch description
> 2. move the spinlock inside the hash lockup, so reducing the holding time.
>
> Note:
> 1. with pid namespace, we might have same pid number for different processes. i
> wonder how the stack trace (w/o dedup) handles the case, it uses tsk->pid as well
> as far as I checked.
> 2. the core functionality is in x86-specific code, this could be moved out to
> support other architectures.
> 3. Andrew made the suggestion of doing appending to stack_hash_table[].
>
> Signed-off-by: Ying Han <yinghan@...gle.com>
> Signed-off-by: Greg Thelen <gthelen@...gle.com>
> ---
> arch/x86/include/asm/stacktrace.h | 11 +++-
> arch/x86/kernel/dumpstack.c | 24 ++++++-
> arch/x86/kernel/dumpstack_32.c | 7 +-
> arch/x86/kernel/dumpstack_64.c | 7 +-
> arch/x86/kernel/stacktrace.c | 123 +++++++++++++++++++++++++++++++++++++
> include/linux/sched.h | 3 +
> include/linux/stacktrace.h | 4 +
> kernel/sched/core.c | 32 +++++++++-
> kernel/stacktrace.c | 15 +++++
> 9 files changed, 211 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
> index 70bbe39..32557fe 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -81,13 +81,20 @@ stack_frame(struct task_struct *task, struct pt_regs *regs)
> }
> #endif
>
> +/*
> + * The parameter dup_stack_pid is used for task stack deduplication.
> + * The non-zero value of dup_stack_pid indicates the pid of the
> + * task with the same stack trace.
> + */
> extern void
> show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
> - unsigned long *stack, unsigned long bp, char *log_lvl);
> + unsigned long *stack, unsigned long bp, char *log_lvl,
> + pid_t dup_stack_pid);
>
> extern void
> show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
> - unsigned long *sp, unsigned long bp, char *log_lvl);
> + unsigned long *sp, unsigned long bp, char *log_lvl,
> + pid_t dup_stack_pid);
>
> extern unsigned int code_bytes;
>
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index 1aae78f..ade9fda 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -159,21 +159,37 @@ static const struct stacktrace_ops print_trace_ops = {
>
> void
> show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
> - unsigned long *stack, unsigned long bp, char *log_lvl)
> + unsigned long *stack, unsigned long bp, char *log_lvl,
> + pid_t dup_stack_pid)
> {
> printk("%sCall Trace:\n", log_lvl);
> - dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
> + if (dup_stack_pid)
> + printk("%s<Same stack as pid %d>", log_lvl, dup_stack_pid);
> + else
> + dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
> }
>
> void show_trace(struct task_struct *task, struct pt_regs *regs,
> unsigned long *stack, unsigned long bp)
> {
> - show_trace_log_lvl(task, regs, stack, bp, "");
> + show_trace_log_lvl(task, regs, stack, bp, "", 0);
> }
>
> void show_stack(struct task_struct *task, unsigned long *sp)
> {
> - show_stack_log_lvl(task, NULL, sp, 0, "");
> + show_stack_log_lvl(task, NULL, sp, 0, "", 0);
> +}
> +
> +/*
> + * Similar to show_stack except accepting the dup_stack_pid parameter.
> + * The parameter indicates whether or not the caller side tries to do
> + * a stack dedup, and the non-zero value indicates the pid of the
> + * task with the same stack trace.
> + */
> +void show_stack_dedup(struct task_struct *task, unsigned long *sp,
> + pid_t dup_stack_pid)
> +{
> + show_stack_log_lvl(task, NULL, sp, 0, "", dup_stack_pid);
> }
>
> /*
> diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
> index c99f9ed..b929c8d 100644
> --- a/arch/x86/kernel/dumpstack_32.c
> +++ b/arch/x86/kernel/dumpstack_32.c
> @@ -56,7 +56,8 @@ EXPORT_SYMBOL(dump_trace);
>
> void
> show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
> - unsigned long *sp, unsigned long bp, char *log_lvl)
> + unsigned long *sp, unsigned long bp, char *log_lvl,
> + pid_t dup_stack_pid)
> {
> unsigned long *stack;
> int i;
> @@ -78,7 +79,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
> touch_nmi_watchdog();
> }
> printk(KERN_CONT "\n");
> - show_trace_log_lvl(task, regs, sp, bp, log_lvl);
> + show_trace_log_lvl(task, regs, sp, bp, log_lvl, dup_stack_pid);
> }
>
>
> @@ -103,7 +104,7 @@ void show_registers(struct pt_regs *regs)
> u8 *ip;
>
> printk(KERN_EMERG "Stack:\n");
> - show_stack_log_lvl(NULL, regs, ®s->sp, 0, KERN_EMERG);
> + show_stack_log_lvl(NULL, regs, ®s->sp, 0, KERN_EMERG, 0);
>
> printk(KERN_EMERG "Code: ");
>
> diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
> index 6d728d9..cd56590 100644
> --- a/arch/x86/kernel/dumpstack_64.c
> +++ b/arch/x86/kernel/dumpstack_64.c
> @@ -198,7 +198,8 @@ EXPORT_SYMBOL(dump_trace);
>
> void
> show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
> - unsigned long *sp, unsigned long bp, char *log_lvl)
> + unsigned long *sp, unsigned long bp, char *log_lvl,
> + pid_t dup_stack_pid)
> {
> unsigned long *irq_stack_end;
> unsigned long *irq_stack;
> @@ -242,7 +243,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
> preempt_enable();
>
> printk(KERN_CONT "\n");
> - show_trace_log_lvl(task, regs, sp, bp, log_lvl);
> + show_trace_log_lvl(task, regs, sp, bp, log_lvl, dup_stack_pid);
> }
>
> void show_registers(struct pt_regs *regs)
> @@ -271,7 +272,7 @@ void show_registers(struct pt_regs *regs)
>
> printk(KERN_EMERG "Stack:\n");
> show_stack_log_lvl(NULL, regs, (unsigned long *)sp,
> - 0, KERN_EMERG);
> + 0, KERN_EMERG, 0);
>
> printk(KERN_EMERG "Code: ");
>
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index fdd0c64..6bee992 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -7,6 +7,7 @@
> #include <linux/stacktrace.h>
> #include <linux/module.h>
> #include <linux/uaccess.h>
> +#include <linux/jhash.h>
> #include <asm/stacktrace.h>
>
> static int save_stack_stack(void *data, char *name)
> @@ -81,6 +82,128 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
> }
> EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
>
> +/*
> + * The implementation of stack trace dedup.
> + *
> + * It tries to reduce the duplication of task stack trace in the dump by hashing
> + * the stack trace. Each time if an identical trace is found in the stack, we
> + * dump only the pid of previous task. So it is easy to back track to the full
> + * stack with the pid.
> + *
> + * Note this chould be moved out of x86-specific code for all architectures
> + * use.
> + */
> +
> +/*
> + * DEDUP_STACK_HASH: pre-allocated buffer size of the hashtable.
> + * DEDUP_STACK_ENTRIES: number of task stack entries in hashtable.
> + * DEDUP_HASH_MAX_ITERATIONS: in hashtable lookup, retry serveral entries if
> + * there is a collision.
> + */
> +#define DEDUP_STACK_HASH 32768
> +#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
> +#define DEDUP_HASH_MAX_ITERATIONS 10
> +
> +/*
> + * The data structure of each hashtable entry
> + */
> +struct task_stack {
> + /* the pid of the task of the stack trace */
> + pid_t pid;
> +
> + /* the length of the stack entries */
> + int len;
> +
> + /* the hash value of the stack trace*/
> + unsigned long hash;
> +};
> +
> +static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
> +static struct task_stack cur_stack;
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
> +
> +/*
> + * The stack hashtable uses linear probing to resolve collisions.
> + * We consider two stacks to be the same if their hash values and lengths
> + * are equal.
> + */
> +static unsigned int stack_trace_lookup(void)
> +{
> + int j;
> + int index;
> + unsigned int ret = 0;
> + struct task_stack *stack;
> +
> + index = cur_stack.hash % DEDUP_STACK_ENTRIES;
> +
> + for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
> + stack = stack_hash_table + (index + j) % DEDUP_STACK_ENTRIES;
> + if (stack->hash == 0) {
> + *stack = cur_stack;
> + ret = 0;
> + break;
> + } else {
> + if (stack->hash == cur_stack.hash &&
> + stack->len == cur_stack.len) {
> + ret = stack->pid;
> + break;
> + }
> + }
> + }
> + if (j == DEDUP_HASH_MAX_ITERATIONS)
> + stack_hash_table[index] = cur_stack;
> +
> + memset(&cur_stack, 0, sizeof(cur_stack));
> +
> + return ret;
> +}
> +
> +static int save_dup_stack_stack(void *data, char *name)
> +{
> + return 0;
> +}
> +
> +static void save_dup_stack_address(void *data, unsigned long addr, int reliable)
> +{
> + /*
> + * To improve de-duplication, we'll only record reliable entries
> + * in the stack trace.
> + */
> + if (!reliable)
> + return;
> + cur_stack.hash = jhash(&addr, sizeof(addr), cur_stack.hash);
> + cur_stack.len++;
> +}
> +
> +static const struct stacktrace_ops save_dup_stack_ops = {
> + .stack = save_dup_stack_stack,
> + .address = save_dup_stack_address,
> + .walk_stack = print_context_stack,
> +};
> +
> +/*
> + * Clear previously saved stack traces to ensure that later printed stacks do
> + * not reference previously printed stacks.
> + */
> +void clear_dup_stack_traces(void)
> +{
> + memset(stack_hash_table, 0, sizeof(stack_hash_table));
> +}
> +
> +unsigned int save_dup_stack_trace(struct task_struct *tsk)
> +{
> + unsigned int ret = 0;
> + unsigned int dummy = 0;
> +
> + spin_lock(&stack_hash_lock);
> + dump_trace(tsk, NULL, NULL, 0, &save_dup_stack_ops, &dummy);
> + cur_stack.pid = tsk->pid;
> + ret = stack_trace_lookup();
> + spin_unlock(&stack_hash_lock);
> +
> + return ret;
> +}
> +
> /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
>
> struct stack_frame_user {
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2234985..0f8af97 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -299,6 +299,9 @@ extern void show_regs(struct pt_regs *);
> */
> extern void show_stack(struct task_struct *task, unsigned long *sp);
>
> +extern void show_stack_dedup(struct task_struct *task, unsigned long *sp,
> + pid_t dup_stack_pid);
> +
> void io_schedule(void);
> long io_schedule_timeout(long timeout);
>
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 115b570..c137416 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -21,6 +21,8 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
>
> extern void print_stack_trace(struct stack_trace *trace, int spaces);
>
> +extern void clear_dup_stack_traces(void);
> +extern unsigned int save_dup_stack_trace(struct task_struct *tsk);
> #ifdef CONFIG_USER_STACKTRACE_SUPPORT
> extern void save_stack_trace_user(struct stack_trace *trace);
> #else
> @@ -32,6 +34,8 @@ extern void save_stack_trace_user(struct stack_trace *trace);
> # define save_stack_trace_tsk(tsk, trace) do { } while (0)
> # define save_stack_trace_user(trace) do { } while (0)
> # define print_stack_trace(trace, spaces) do { } while (0)
> +# define clear_dup_stack_traces() do { } while (0)
> +# define save_dup_stack_trace(tsk) do { } while (0)
> #endif
>
> #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index df00cb0..b2b9f7d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -71,6 +71,7 @@
> #include <linux/ftrace.h>
> #include <linux/slab.h>
> #include <linux/init_task.h>
> +#include <linux/stacktrace.h>
>
> #include <asm/tlb.h>
> #include <asm/irq_regs.h>
> @@ -4763,10 +4764,11 @@ out_unlock:
>
> static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
>
> -void sched_show_task(struct task_struct *p)
> +void _sched_show_task(struct task_struct *p, int dedup)
> {
> unsigned long free = 0;
> unsigned state;
> + pid_t dup_stack_pid = 0;
>
> state = p->state ? __ffs(p->state) + 1 : 0;
> printk(KERN_INFO "%-15.15s %c", p->comm,
> @@ -4789,13 +4791,37 @@ void sched_show_task(struct task_struct *p)
> task_pid_nr(p), task_pid_nr(rcu_dereference(p->real_parent)),
> (unsigned long)task_thread_info(p)->flags);
>
> - show_stack(p, NULL);
> + if (dedup) {
> + dup_stack_pid = save_dup_stack_trace(p);
> + show_stack_dedup(p, NULL, dup_stack_pid);
> + } else
> + show_stack(p, NULL);
> +}
> +
> +void sched_show_task(struct task_struct *p)
> +{
> + _sched_show_task(p, 0);
> +}
> +
> +/*
> + * Eliminate task stack trace duplication in multi-task stackdump.
> + * Note only x86-specific code now implements the feature.
> + */
> +void sched_show_task_dedup(struct task_struct *p)
> +{
> + _sched_show_task(p, 1);
> }
>
> void show_state_filter(unsigned long state_filter)
> {
> struct task_struct *g, *p;
>
> + /*
> + * Prevent below printed stack traces from referring to previously
> + * printed ones.
> + */
> + clear_dup_stack_traces();
> +
> #if BITS_PER_LONG == 32
> printk(KERN_INFO
> " task PC stack pid father\n");
> @@ -4811,7 +4837,7 @@ void show_state_filter(unsigned long state_filter)
> */
> touch_nmi_watchdog();
> if (!state_filter || (p->state & state_filter))
> - sched_show_task(p);
> + sched_show_task_dedup(p);
> } while_each_thread(g, p);
>
> touch_all_softlockup_watchdogs();
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 00fe55c..85afece 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -41,3 +41,18 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
> {
> WARN_ONCE(1, KERN_INFO "save_stack_trace_regs() not implemented yet.\n");
> }
> +
> +/*
> + * Architectures that do not implement the task stack dedup will fallback to
> + * the default functionality.
> + */
> +__weak void
> +clear_dup_stack_traces(void)
> +{
> +}
> +
> +__weak unsigned int
> +save_dup_stack_trace(struct task_struct *tsk)
> +{
> + return 0;
> +}
> --
> 1.7.7.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists