linux-kernel - Re: [RFC 1/2] kernel patch for dump user space stack tool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4F8FACFF.9070107@gmail.com>
Date:	Thu, 19 Apr 2012 14:13:19 +0800
From:	Cong Wang <xiyou.wangcong@...il.com>
To:	Yanmin Zhang <yanmin_zhang@...ux.intel.com>
CC:	"Tu, Xiaobing" <xiaobing.tu@...el.com>,
	Lin Ming <mlin@...pku.edu.cn>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"mingo@...e.hu" <mingo@...e.hu>,
	"rusty@...tcorp.com.au" <rusty@...tcorp.com.au>,
	"a.p.zijlstra@...llo.nl" <a.p.zijlstra@...llo.nl>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"rostedt@...dmis.org" <rostedt@...dmis.org>,
	"Zuo, Jiao" <jiao.zuo@...el.com>
Subject: Re: [RFC 1/2] kernel patch for dump user space stack tool

On 04/19/2012 01:17 PM, Yanmin Zhang wrote:
> On Thu, 2012-04-19 at 11:50 +0800, Cong Wang wrote:
>> On 04/17/2012 10:37 PM, Tu, Xiaobing wrote:
>>> Resend the patch because of the log is too long on a single line.
>>>
>>> From: xiaobing tu<xiaobing.tu@...el.com>
>>>
>>> Here is the kernel patch for this tool, The idea is to output user space stack call-chain from
>>> /proc/xxx/stack, currently, /proc/xxx/stack only output kernel stack call chain. We extend
>>> it to output user space call chain in hex format
>>>
>>
>> Can you teach me why we still need this as we have pstack?
> Cong,
>
> Sorry for replying so late. Xiaobing told me you sent him email and I
> didn't receive the 1st one you sent out.


Based on the length of your reply and the description of the patch, you 
hide lots of information in your patch description.

>
> I tried pstack and it does work. It means developers in the world wanted
> the tool long long ago.
>
> Although not checking the source codes of pstack (sorry, I'm busy in debugging
> many critical issues), I think pstack is based on ptrace interface, which means:
> 1) It need traps into system for many times to collect call frames of one
> task.
> 2) It need send signal to the ptraced process to stop it. Such behavior
> might have some impact if the ptraced process also processes many signals.
> 3) The data parsing to get symbols might not be split from data collection.
> I mean, it collects call frames of one process, then parses it; then collects the 2nd
> task's. If there are many processes, it couldn't collect the data just at the monitor
> time point.


Yet another one who wants to "fix" ptrace. ;-)

>
> Why do we work out the tools? The original requirement is from real work.
> We are enabling Android on Medfield. One typical error of Android is ANR.
> When a process couldn't respond in 5 seconds, Android reports an ANR error,
> and dumps JAVA call stack. However, it couldn't dump userspace lib (such like
> bionic, written by C or C++). In addition, Android just dumps the stack of
> the non-responding process. It doesn't dump stack of others. As binder is basic
> framework in Android, processes communicate by binder in the model of client/server.
> When one process is not responding quickly, maybe another process blocks it. We
> need dump that process status.
>
> Many teams complained it's hard to debug such ANR issues, especially the ones which
> are triggered at MTBF testing. Sometimes, an ANR happens after MTBF testing runs
> for one week. Developers ask us to implement such tool over and over again.
>
> Besides ANR, sometimes, system might not respond to any user operation. Usually,
> kernel or firmware would reset system. At that time, we also need get the call
> chains of all the user space processes before system is reset.


I am not familiar with Andriod at all, so a quick question is if this is 
only for Andriod, why you introduce this for all? IOW, why not provide a 
Kconfig?

BTW, I am sure you need to put the above paragraphs into your patch 
description, to make it clear why the patch is needed.

>
> With our tool,
> 1) We could collect the HEX-format call chain data and /proc/XXX/maps
> of all the processes quickly, then parse them either after rebooting, or
> after the issue is reported. It could catch the scene just at the time point
> when the error happens. Our experiments shows the tool could collect the data
> of all processes within 200ms.
> 2) The new tool won't stop the processes and have less impact on them.
> Considering a scenario of performance bottleneck investigation, statistics collection
> shouldn't have big impact on running processes.
> 3) It could support both i386 and x86-64. I tried pstack and it doesn't work
> with x86-64.
> 4) It follows /proc/XXX/stack interface and it's easy to use it.
>
> Besides this tool, we are considering to extend it to collect user space
> call chain of current process from kernel when kernel detects some other
> abnormal behavior.
>

In my previous reply, I ran 'pstrack' on my x86-64 machine, don't 
understand why you said it doesn't work with x86-64? I guess pstack 
supports more than just x86, as ptrace is available in other arch's too.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/