lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <553B9EC0.9060001@huawei.com>
Date:	Sat, 25 Apr 2015 22:03:44 +0800
From:	Yunlong Song <yunlong.song@...wei.com>
To:	<a.p.zijlstra@...llo.nl>, <paulus@...ba.org>, <mingo@...hat.com>,
	"Arnaldo Carvalho de Melo" <acme@...nel.org>
CC:	<linux-kernel@...r.kernel.org>, <wangnan0@...wei.com>,
	<dsahern@...il.com>
Subject: Re: [Question] How does perf still record the stack of a specified
 pid even when that process is interrupted and CPU is scheduled to other process

On 2015/4/24 21:49, Yunlong Song wrote:
> On 2015/4/24 21:31, Yunlong Song wrote:
>> [Profiling Background]
>> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone,
>> we find a case that ext4 is better than f2fs in random write under the test of
>> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two
>> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of
>> the overall sys_write (262144/64=4096). We want to find out the call stack during
>> this specific 1% sys_write. Our idea is to record the stack in a certain time period
>> and since the specific 1% case takes up 60% time, the total number of records of its
>> stack should also takes up 60% of the total records, then we can recognize those stacks
>> and figure out what the I/O stack of f2fs is doing in the 1% case.
>>
>> [Profiling Problem]
>>
>> Although perf can record the events (with call stack) of a specified pid, e.g. using
>> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted
>> and the CPU is scheduled to other process. As a result, perf will not record any events
>> of iozone until iozone's context is recovered and the CPU is scheduled to continue
>> processing the sys_write of iozone. 
> 
>> This obeys our initial idea which is described in [Profiling Background],
> 
> This "disobeys" our initial idea which is described in [Profiling Background],
> 
>  since we cannot recognize the call stacks of the specific 1% case
>> by using the ratio of the record number.
>>
>> [Alternative Solution without Perf]
>> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone,
>> no matter whether iozone is interrupted or not. However, we have not taken a deep sight
>> into this, since we want to use perf to do this kind of thing.
>>
>> [Question about Perf]
>> So we have a question that "How does perf still record the stack of a specified pid even
>> when that process is interrupted and CPU is scheduled to other process?"
>>
> 
> 

Here is another example:

$ cat test.c
void func_loop(void)
{
        int i, sum = 0;
        for (i = 0; i < 100000; i++)
                sum += 1;
}

void func_sleep(void)
{
        sleep(1);
}

int main(void)
{
        while (1) {
                func_loop();
                func_sleep();
        }
        return 0;
}

$ gcc test.c -o test -O0
$ ./test
$ ps aux | grep test
Yunlong+ 133798  0.0  0.0   4056   348 pts/7    S+   21:52   0:00 ./test
$ perf record -g -p 133798 sleep 5
$ perf report -g --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
# Samples: 16  of event 'cycles'
# Event count (approx.): 2439331
#
# Children      Self  Command  Shared Object      Symbol
# ........  ........  .......  .................  ...............................
#
    79.46%    76.48%  test     test               [.] func_loop
               |
               ---func_loop
                  |
                  |--97.07%-- main
                  |          __libc_start_main
                   --2.93%-- [...]

    77.14%     0.00%  test     libc-2.19.so       [.] __libc_start_main
               |
               ---__libc_start_main

    77.14%     0.00%  test     test               [.] main
               |
               ---main
                  __libc_start_main

    20.54%     0.00%  test     [unknown]          [.] 0000000000000000
               |
               ---0

    20.54%     0.00%  test     libc-2.19.so       [.] __GI___libc_nanosleep
               |
               ---__GI___libc_nanosleep
                  0

    20.54%     0.00%  test     [unknown]          [k] 0x00007f2fa07ef6b0
               |
               ---0x7f2fa07ef6b0
                  __GI___libc_nanosleep
                  0

    20.54%     0.00%  test     [kernel.kallsyms]  [k] system_call_fastpath
               |
               ---system_call_fastpath
                  0x7f2fa07ef6b0
                  __GI___libc_nanosleep

As shown above, the func_loop is regarded as the main operation and its call stack
is showed on the top of perf report. However, we know that func_sleep is really the
main operation from the view of time duration indeed. And if we want to its actual
call stack in the real time, we can see the result as following:

$ while true ; do cat /proc/133798/stack  ; sleep 0.1 ; done
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff8107aed3>] hrtimer_nanosleep+0xa3/0x160
[<ffffffff8107afed>] SyS_nanosleep+0x5d/0x80
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
[<00007f2fa07ef6b0>] 0x7f2fa07ef6b0
[<ffffffffffffffff>] 0xffffffffffffffff

As shown above, func_sleep is regarded as the main operation from the view of time and
its call stack is printed in real time.

-- 
Thanks,
Yunlong Song

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ