linux-kernel - Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1ad88988-2a15-753a-bd06-062a788d3b77@collabora.com>
Date:	Wed, 17 Aug 2016 12:48:54 -0400
From:	Robert Foss <robert.foss@...labora.com>
To:	Michal Hocko <mhocko@...nel.org>, Jann Horn <jann@...jh.net>,
	sonnyrao@...omium.org
Cc:	corbet@....net, akpm@...ux-foundation.org, vbabka@...e.cz,
	koct9i@...il.com, hughd@...gle.com, n-horiguchi@...jp.nec.com,
	minchan@...nel.org, john.stultz@...aro.org,
	ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
	hannes@...xchg.org, keescook@...omium.org, viro@...iv.linux.org.uk,
	gorcunov@...nvz.org, plaguedbypenguins@...il.com,
	rientjes@...gle.com, eric.engestrom@...tec.com, jdanis@...gle.com,
	calvinowens@...com, adobriyan@...il.com,
	kirill.shutemov@...ux.intel.com, ldufour@...ux.vnet.ibm.com,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	Ben Zhang <benzh@...omium.org>,
	Bryan Freed <bfreed@...omium.org>,
	Filipe Brandenburger <filbranden@...omium.org>,
	Mateusz Guzik <mguzik@...hat.com>
Subject: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps



On 2016-08-17 09:03 AM, Michal Hocko wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>>> On Tue 16-08-16 12:46:51, Robert Foss wrote:
>>> [...]
>>>> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>>>> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>>>> /proc/5025/smaps }"
>>>> [...]
>>>> 	Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>>>> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
>>>> }"
>>>> 	User time (seconds): 0.37
>>>> 	System time (seconds): 0.45
>>>> 	Percent of CPU this job got: 92%
>>>> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>>>
>>> This is really unexpected. Where is the user time spent? Anyway, rather
>>> than measuring some random processes I've tried to measure something
>>> resembling the worst case. So I've created a simple program to mmap as
>>> much as possible:
>>>
>>> #include <sys/mman.h>
>>> #include <sys/types.h>
>>> #include <unistd.h>
>>> #include <stdio.h>
>>> int main()
>>> {
>>> 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>>> 		;
>>>
>>> 	printf("pid:%d\n", getpid());
>>> 	pause();
>>> 	return 0;
>>> }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>>> So with a reasonable user space the parsing is really not all that time
>>> consuming wrt. smaps handling. That being said I am still very skeptical
>>> about a dedicated proc file which accomplishes what userspace can done
>>> in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <unistd.h>
>> #include <err.h>
>> #include <stdio.h>
>>
>> char buf[1000000];
>>
>> int main(int argc, char **argv) {
>>   printf("pid:%d\n", getpid());
>>   while (1) {
>>     int fd = open(argv[1], O_RDONLY);
>>     if (fd < 0) continue;
>>     if (read(fd, buf, sizeof(buf)) < 0)
>>       err(1, "read");
>>     close(fd);
>>   }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead  Shared Object     Symbol
>>   30,43%  [kernel]          [k] format_decode
>>    9,12%  [kernel]          [k] number
>>    7,66%  [kernel]          [k] vsnprintf
>>    7,06%  [kernel]          [k] __lock_acquire
>>    3,23%  [kernel]          [k] lock_release
>>    2,85%  [kernel]          [k] debug_lockdep_rcu_enabled
>>    2,25%  [kernel]          [k] skip_atoi
>>    2,13%  [kernel]          [k] lock_acquire
>>    2,05%  [kernel]          [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace. If this was something like /proc/<pid>/stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why? These are the
> questions which should be answered before we even start considering the
> implementation.

@Sonny Rao: Maybe you can comment on how often, for how many processes 
this information is needed and for which reasons this information is useful.

> ---
> From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@...e.com>
> Date: Wed, 17 Aug 2016 14:00:13 +0200
> Subject: [PATCH] proc, smaps: reduce printing overhead
>
> seq_printf (used by show_smap) can be pretty expensive when dumping a
> lot of numbers.  Say we would like to get Rss and Pss from a particular
> process.  In order to measure a pathological case let's generate as many
> mappings as possible:
>
> $ cat max_mmap.c
> int main()
> {
> 	while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> 		;
>
> 	printf("pid:%d\n", getpid());
> 	pause();
> 	return 0;
> }
>
> $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
>
> would do a trick. The whole runtime is in the kernel space which is not
> that that unexpected because smaps is not the cheapest one (we have to
> do rmap walk etc.).
>
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
>         User time (seconds): 0.01
>         System time (seconds): 0.44
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
>
> But the perf says:
>     22.55%  awk      [kernel.kallsyms]  [k] format_decode
>     14.65%  awk      [kernel.kallsyms]  [k] vsnprintf
>      6.40%  awk      [kernel.kallsyms]  [k] number
>      2.53%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      2.53%  awk      [kernel.kallsyms]  [k] show_smap
>      1.81%  awk      [kernel.kallsyms]  [k] lock_acquire
>
> we are spending most of the time actually generating the output which is
> quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
> This will give us:
>         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
>         User time (seconds): 0.00
>         System time (seconds): 0.41
>         Percent of CPU this job got: 99%
>         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42
>
> which will give us ~7% improvement. Perf says:
>     28.87%  awk      [kernel.kallsyms]  [k] seq_puts
>      5.30%  awk      [kernel.kallsyms]  [k] vsnprintf
>      4.54%  awk      [kernel.kallsyms]  [k] format_decode
>      3.73%  awk      [kernel.kallsyms]  [k] show_smap
>      2.56%  awk      [kernel.kallsyms]  [k] shmem_mapping
>      1.92%  awk      [kernel.kallsyms]  [k] number
>      1.80%  awk      [kernel.kallsyms]  [k] lock_acquire
>      1.75%  awk      [kernel.kallsyms]  [k] print_name_value_kb
>
> Reported-by: Jann Horn <jann@...jh.net>
> Signed-off-by: Michal Hocko <mhocko@...e.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
>  1 file changed, 25 insertions(+), 38 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 187d84ef9de9..41c24c0811da 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
>  {
>  }
>
> +static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
> +{
> +	seq_puts(m, name);
> +	seq_put_decimal_ull(m, 0, val);
> +	seq_puts(m, " kB\n");
> +}
> +
>  static int show_smap(struct seq_file *m, void *v, int is_pid)
>  {
>  	struct vm_area_struct *vma = v;
> @@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>
>  	show_map_vma(m, vma, is_pid);
>
> -	seq_printf(m,
> -		   "Size:           %8lu kB\n"
> -		   "Rss:            %8lu kB\n"
> -		   "Pss:            %8lu kB\n"
> -		   "Shared_Clean:   %8lu kB\n"
> -		   "Shared_Dirty:   %8lu kB\n"
> -		   "Private_Clean:  %8lu kB\n"
> -		   "Private_Dirty:  %8lu kB\n"
> -		   "Referenced:     %8lu kB\n"
> -		   "Anonymous:      %8lu kB\n"
> -		   "AnonHugePages:  %8lu kB\n"
> -		   "ShmemPmdMapped: %8lu kB\n"
> -		   "Shared_Hugetlb: %8lu kB\n"
> -		   "Private_Hugetlb: %7lu kB\n"
> -		   "Swap:           %8lu kB\n"
> -		   "SwapPss:        %8lu kB\n"
> -		   "KernelPageSize: %8lu kB\n"
> -		   "MMUPageSize:    %8lu kB\n"
> -		   "Locked:         %8lu kB\n",
> -		   (vma->vm_end - vma->vm_start) >> 10,
> -		   mss.resident >> 10,
> -		   (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
> -		   mss.shared_clean  >> 10,
> -		   mss.shared_dirty  >> 10,
> -		   mss.private_clean >> 10,
> -		   mss.private_dirty >> 10,
> -		   mss.referenced >> 10,
> -		   mss.anonymous >> 10,
> -		   mss.anonymous_thp >> 10,
> -		   mss.shmem_thp >> 10,
> -		   mss.shared_hugetlb >> 10,
> -		   mss.private_hugetlb >> 10,
> -		   mss.swap >> 10,
> -		   (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
> -		   vma_kernel_pagesize(vma) >> 10,
> -		   vma_mmu_pagesize(vma) >> 10,
> -		   (vma->vm_flags & VM_LOCKED) ?
> +	print_name_value_kb(m, "Size:           ", (vma->vm_end - vma->vm_start) >> 10);
> +	print_name_value_kb(m, "Rss:            ", mss.resident >> 10);
> +	print_name_value_kb(m, "Pss:            ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
> +	print_name_value_kb(m, "Shared_Clean:   ", mss.shared_clean  >> 10);
> +	print_name_value_kb(m, "Shared_Dirty:   ", mss.shared_dirty  >> 10);
> +	print_name_value_kb(m, "Private_Clean:  ", mss.private_clean >> 10);
> +	print_name_value_kb(m, "Private_Dirty:  ", mss.private_dirty >> 10);
> +	print_name_value_kb(m, "Referenced:     ", mss.referenced >> 10);
> +	print_name_value_kb(m, "Anonymous:      ", mss.anonymous >> 10);
> +	print_name_value_kb(m, "AnonHugePages:  ", mss.anonymous_thp >> 10);
> +	print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
> +	print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
> +	print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
> +	print_name_value_kb(m, "Swap:           ", mss.swap >> 10);
> +	print_name_value_kb(m, "SwapPss:        ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
> +	print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
> +	print_name_value_kb(m, "MMUPageSize:    ", vma_mmu_pagesize(vma) >> 10);
> +	print_name_value_kb(m, "Locked:         ", (vma->vm_flags & VM_LOCKED) ?
>  			(unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
> -
>  	arch_show_smap(m, vma);
>  	show_smap_vma_flags(m, vma);
>  	m_cache_vma(m, vma);
>