[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F8FACFF.9070107@gmail.com>
Date: Thu, 19 Apr 2012 14:13:19 +0800
From: Cong Wang <xiyou.wangcong@...il.com>
To: Yanmin Zhang <yanmin_zhang@...ux.intel.com>
CC: "Tu, Xiaobing" <xiaobing.tu@...el.com>,
Lin Ming <mlin@...pku.edu.cn>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"mingo@...e.hu" <mingo@...e.hu>,
"rusty@...tcorp.com.au" <rusty@...tcorp.com.au>,
"a.p.zijlstra@...llo.nl" <a.p.zijlstra@...llo.nl>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rostedt@...dmis.org" <rostedt@...dmis.org>,
"Zuo, Jiao" <jiao.zuo@...el.com>
Subject: Re: [RFC 1/2] kernel patch for dump user space stack tool
On 04/19/2012 01:17 PM, Yanmin Zhang wrote:
> On Thu, 2012-04-19 at 11:50 +0800, Cong Wang wrote:
>> On 04/17/2012 10:37 PM, Tu, Xiaobing wrote:
>>> Resend the patch because of the log is too long on a single line.
>>>
>>> From: xiaobing tu<xiaobing.tu@...el.com>
>>>
>>> Here is the kernel patch for this tool, The idea is to output user space stack call-chain from
>>> /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend
>>> it to output user space call chain in hex format
>>>
>>
>> Can you teach me why we still need this as we have pstack?
> Cong,
>
> Sorry for replying so late. Xiaobing told me you sent him email and I
> didn't receive the 1st one you sent out.
Based on the length of your reply and the description of the patch, you
hide lots of information in your patch description.
>
> I tried pstack and it does work. It means developers in the world wanted
> the tool long long ago.
>
> Although not checking the source codes of pstack (sorry, I'm busy in debugging
> many critical issues), I think pstack is based on ptrace interface, which means:
> 1) It need traps into system for many times to collect call frames of one
> task.
> 2) It need send signal to the ptraced process to stop it. Such behavior
> might have some impact if the ptraced process also processes many signals.
> 3) The data parsing to get symbols might not be split from data collection.
> I mean, it collects call frames of one process, then parses it; then collects the 2nd
> task's. If there are many processes, it couldn't collect the data just at the monitor
> time point.
Yet another one who wants to "fix" ptrace. ;-)
>
> Why do we work out the tools? The original requirement is from real work.
> We are enabling Android on Medfield. One typical error of Android is ANR.
> When a process couldn't respond in 5 seconds, Android reports an ANR error,
> and dumps JAVA call stack. However, it couldn't dump userspace lib (such like
> bionic, written by C or C++). In addition, Android just dumps the stack of
> the non-responding process. It doesn't dump stack of others. As binder is basic
> framework in Android, processes communicate by binder in the model of client/server.
> When one process is not responding quickly, maybe another process blocks it. We
> need dump that process status.
>
> Many teams complained it's hard to debug such ANR issues, especially the ones which
> are triggered at MTBF testing. Sometimes, an ANR happens after MTBF testing runs
> for one week. Developers ask us to implement such tool over and over again.
>
> Besides ANR, sometimes, system might not respond to any user operation. Usually,
> kernel or firmware would reset system. At that time, we also need get the call
> chains of all the user space processes before system is reset.
I am not familiar with Andriod at all, so a quick question is if this is
only for Andriod, why you introduce this for all? IOW, why not provide a
Kconfig?
BTW, I am sure you need to put the above paragraphs into your patch
description, to make it clear why the patch is needed.
>
> With our tool,
> 1) We could collect the HEX-format call chain data and /proc/XXX/maps
> of all the processes quickly, then parse them either after rebooting, or
> after the issue is reported. It could catch the scene just at the time point
> when the error happens. Our experiments shows the tool could collect the data
> of all processes within 200ms.
> 2) The new tool won't stop the processes and have less impact on them.
> Considering a scenario of performance bottleneck investigation, statistics collection
> shouldn't have big impact on running processes.
> 3) It could support both i386 and x86-64. I tried pstack and it doesn't work
> with x86-64.
> 4) It follows /proc/XXX/stack interface and it's easy to use it.
>
> Besides this tool, we are considering to extend it to collect user space
> call chain of current process from kernel when kernel detects some other
> abnormal behavior.
>
In my previous reply, I ran 'pstrack' on my x86-64 machine, don't
understand why you said it doesn't work with x86-64? I guess pstack
supports more than just x86, as ptrace is available in other arch's too.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists