[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1ad88988-2a15-753a-bd06-062a788d3b77@collabora.com>
Date: Wed, 17 Aug 2016 12:48:54 -0400
From: Robert Foss <robert.foss@...labora.com>
To: Michal Hocko <mhocko@...nel.org>, Jann Horn <jann@...jh.net>,
sonnyrao@...omium.org
Cc: corbet@....net, akpm@...ux-foundation.org, vbabka@...e.cz,
koct9i@...il.com, hughd@...gle.com, n-horiguchi@...jp.nec.com,
minchan@...nel.org, john.stultz@...aro.org,
ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
hannes@...xchg.org, keescook@...omium.org, viro@...iv.linux.org.uk,
gorcunov@...nvz.org, plaguedbypenguins@...il.com,
rientjes@...gle.com, eric.engestrom@...tec.com, jdanis@...gle.com,
calvinowens@...com, adobriyan@...il.com,
kirill.shutemov@...ux.intel.com, ldufour@...ux.vnet.ibm.com,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
Ben Zhang <benzh@...omium.org>,
Bryan Freed <bfreed@...omium.org>,
Filipe Brandenburger <filbranden@...omium.org>,
Mateusz Guzik <mguzik@...hat.com>
Subject: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps
On 2016-08-17 09:03 AM, Michal Hocko wrote:
> On Wed 17-08-16 11:31:25, Jann Horn wrote:
>> On Wed, Aug 17, 2016 at 10:22:00AM +0200, Michal Hocko wrote:
>>> On Tue 16-08-16 12:46:51, Robert Foss wrote:
>>> [...]
>>>> $ /usr/bin/time -v -p zsh -c "repeat 25 { awk '/^Rss/{rss+=\$2}
>>>> /^Pss/{pss+=\$2} END {printf \"rss:%d pss:%d\n\", rss, pss}\'
>>>> /proc/5025/smaps }"
>>>> [...]
>>>> Command being timed: "zsh -c repeat 25 { awk '/^Rss/{rss+=$2}
>>>> /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}\' /proc/5025/smaps
>>>> }"
>>>> User time (seconds): 0.37
>>>> System time (seconds): 0.45
>>>> Percent of CPU this job got: 92%
>>>> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.89
>>>
>>> This is really unexpected. Where is the user time spent? Anyway, rather
>>> than measuring some random processes I've tried to measure something
>>> resembling the worst case. So I've created a simple program to mmap as
>>> much as possible:
>>>
>>> #include <sys/mman.h>
>>> #include <sys/types.h>
>>> #include <unistd.h>
>>> #include <stdio.h>
>>> int main()
>>> {
>>> while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
>>> ;
>>>
>>> printf("pid:%d\n", getpid());
>>> pause();
>>> return 0;
>>> }
>>
>> Ah, nice, that's a reasonable test program. :)
>>
>>
>>> So with a reasonable user space the parsing is really not all that time
>>> consuming wrt. smaps handling. That being said I am still very skeptical
>>> about a dedicated proc file which accomplishes what userspace can done
>>> in a trivial way.
>>
>> Now, since your numbers showed that all the time is spent in the kernel,
>> also create this test program to just read that file over and over again:
>>
>> $ cat justreadloop.c
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <sched.h>
>> #include <unistd.h>
>> #include <err.h>
>> #include <stdio.h>
>>
>> char buf[1000000];
>>
>> int main(int argc, char **argv) {
>> printf("pid:%d\n", getpid());
>> while (1) {
>> int fd = open(argv[1], O_RDONLY);
>> if (fd < 0) continue;
>> if (read(fd, buf, sizeof(buf)) < 0)
>> err(1, "read");
>> close(fd);
>> }
>> }
>> $ gcc -Wall -o justreadloop justreadloop.c
>> $
>>
>> Now launch your test:
>>
>> $ ./mapstuff
>> pid:29397
>>
>> point justreadloop at it:
>>
>> $ ./justreadloop /proc/29397/smaps
>> pid:32567
>>
>> ... and then check the performance stats of justreadloop:
>>
>> # perf top -p 32567
>>
>> This is what I see:
>>
>> Samples: 232K of event 'cycles:ppp', Event count (approx.): 60448424325
>> Overhead Shared Object Symbol
>> 30,43% [kernel] [k] format_decode
>> 9,12% [kernel] [k] number
>> 7,66% [kernel] [k] vsnprintf
>> 7,06% [kernel] [k] __lock_acquire
>> 3,23% [kernel] [k] lock_release
>> 2,85% [kernel] [k] debug_lockdep_rcu_enabled
>> 2,25% [kernel] [k] skip_atoi
>> 2,13% [kernel] [k] lock_acquire
>> 2,05% [kernel] [k] show_smap
>
> This is a lot! I would expect the rmap walk to consume more but it even
> doesn't show up in the top consumers.
>
>> That's at least 30.43% + 9.12% + 7.66% = 47.21% of the task's kernel
>> time spent on evaluating format strings. The new interface
>> wouldn't have to spend that much time on format strings because there
>> isn't so much text to format.
>
> well, this is true of course but I would much rather try to reduce the
> overhead of smaps file than add a new file. The following should help
> already. I've measured ~7% systime cut down. I guess there is still some
> room for improvements but I have to say I'm far from being convinced about
> a new proc file just because we suck at dumping information to the
> userspace. If this was something like /proc/<pid>/stat which is
> essentially read all the time then it would be a different question but
> is the rss, pss going to be all that often? If yes why? These are the
> questions which should be answered before we even start considering the
> implementation.
@Sonny Rao: Maybe you can comment on how often, for how many processes
this information is needed and for which reasons this information is useful.
> ---
> From 2a6883a7278ff8979808cb8e2dbcefe5ea3bf672 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@...e.com>
> Date: Wed, 17 Aug 2016 14:00:13 +0200
> Subject: [PATCH] proc, smaps: reduce printing overhead
>
> seq_printf (used by show_smap) can be pretty expensive when dumping a
> lot of numbers. Say we would like to get Rss and Pss from a particular
> process. In order to measure a pathological case let's generate as many
> mappings as possible:
>
> $ cat max_mmap.c
> int main()
> {
> while (mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED|MAP_POPULATE, -1, 0) != MAP_FAILED)
> ;
>
> printf("pid:%d\n", getpid());
> pause();
> return 0;
> }
>
> $ awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss}' /proc/$pid/smaps
>
> would do a trick. The whole runtime is in the kernel space which is not
> that that unexpected because smaps is not the cheapest one (we have to
> do rmap walk etc.).
>
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3050/smaps"
> User time (seconds): 0.01
> System time (seconds): 0.44
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.47
>
> But the perf says:
> 22.55% awk [kernel.kallsyms] [k] format_decode
> 14.65% awk [kernel.kallsyms] [k] vsnprintf
> 6.40% awk [kernel.kallsyms] [k] number
> 2.53% awk [kernel.kallsyms] [k] shmem_mapping
> 2.53% awk [kernel.kallsyms] [k] show_smap
> 1.81% awk [kernel.kallsyms] [k] lock_acquire
>
> we are spending most of the time actually generating the output which is
> quite lame. Let's replace seq_printf by seq_puts and seq_put_decimal_ull.
> This will give us:
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf "rss:%d pss:%d\n", rss, pss} /proc/3067/smaps"
> User time (seconds): 0.00
> System time (seconds): 0.41
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.42
>
> which will give us ~7% improvement. Perf says:
> 28.87% awk [kernel.kallsyms] [k] seq_puts
> 5.30% awk [kernel.kallsyms] [k] vsnprintf
> 4.54% awk [kernel.kallsyms] [k] format_decode
> 3.73% awk [kernel.kallsyms] [k] show_smap
> 2.56% awk [kernel.kallsyms] [k] shmem_mapping
> 1.92% awk [kernel.kallsyms] [k] number
> 1.80% awk [kernel.kallsyms] [k] lock_acquire
> 1.75% awk [kernel.kallsyms] [k] print_name_value_kb
>
> Reported-by: Jann Horn <jann@...jh.net>
> Signed-off-by: Michal Hocko <mhocko@...e.com>
> ---
> fs/proc/task_mmu.c | 63 ++++++++++++++++++++++--------------------------------
> 1 file changed, 25 insertions(+), 38 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 187d84ef9de9..41c24c0811da 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -721,6 +721,13 @@ void __weak arch_show_smap(struct seq_file *m, struct vm_area_struct *vma)
> {
> }
>
> +static void print_name_value_kb(struct seq_file *m, const char *name, unsigned long val)
> +{
> + seq_puts(m, name);
> + seq_put_decimal_ull(m, 0, val);
> + seq_puts(m, " kB\n");
> +}
> +
> static int show_smap(struct seq_file *m, void *v, int is_pid)
> {
> struct vm_area_struct *vma = v;
> @@ -765,45 +772,25 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>
> show_map_vma(m, vma, is_pid);
>
> - seq_printf(m,
> - "Size: %8lu kB\n"
> - "Rss: %8lu kB\n"
> - "Pss: %8lu kB\n"
> - "Shared_Clean: %8lu kB\n"
> - "Shared_Dirty: %8lu kB\n"
> - "Private_Clean: %8lu kB\n"
> - "Private_Dirty: %8lu kB\n"
> - "Referenced: %8lu kB\n"
> - "Anonymous: %8lu kB\n"
> - "AnonHugePages: %8lu kB\n"
> - "ShmemPmdMapped: %8lu kB\n"
> - "Shared_Hugetlb: %8lu kB\n"
> - "Private_Hugetlb: %7lu kB\n"
> - "Swap: %8lu kB\n"
> - "SwapPss: %8lu kB\n"
> - "KernelPageSize: %8lu kB\n"
> - "MMUPageSize: %8lu kB\n"
> - "Locked: %8lu kB\n",
> - (vma->vm_end - vma->vm_start) >> 10,
> - mss.resident >> 10,
> - (unsigned long)(mss.pss >> (10 + PSS_SHIFT)),
> - mss.shared_clean >> 10,
> - mss.shared_dirty >> 10,
> - mss.private_clean >> 10,
> - mss.private_dirty >> 10,
> - mss.referenced >> 10,
> - mss.anonymous >> 10,
> - mss.anonymous_thp >> 10,
> - mss.shmem_thp >> 10,
> - mss.shared_hugetlb >> 10,
> - mss.private_hugetlb >> 10,
> - mss.swap >> 10,
> - (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)),
> - vma_kernel_pagesize(vma) >> 10,
> - vma_mmu_pagesize(vma) >> 10,
> - (vma->vm_flags & VM_LOCKED) ?
> + print_name_value_kb(m, "Size: ", (vma->vm_end - vma->vm_start) >> 10);
> + print_name_value_kb(m, "Rss: ", mss.resident >> 10);
> + print_name_value_kb(m, "Pss: ", (unsigned long)(mss.pss >> (10 + PSS_SHIFT)));
> + print_name_value_kb(m, "Shared_Clean: ", mss.shared_clean >> 10);
> + print_name_value_kb(m, "Shared_Dirty: ", mss.shared_dirty >> 10);
> + print_name_value_kb(m, "Private_Clean: ", mss.private_clean >> 10);
> + print_name_value_kb(m, "Private_Dirty: ", mss.private_dirty >> 10);
> + print_name_value_kb(m, "Referenced: ", mss.referenced >> 10);
> + print_name_value_kb(m, "Anonymous: ", mss.anonymous >> 10);
> + print_name_value_kb(m, "AnonHugePages: ", mss.anonymous_thp >> 10);
> + print_name_value_kb(m, "ShmemPmdMapped: ", mss.shmem_thp >> 10);
> + print_name_value_kb(m, "Shared_Hugetlb: ", mss.shared_hugetlb >> 10);
> + print_name_value_kb(m, "Private_Hugetlb: ", mss.private_hugetlb >> 10);
> + print_name_value_kb(m, "Swap: ", mss.swap >> 10);
> + print_name_value_kb(m, "SwapPss: ", (unsigned long)(mss.swap_pss >> (10 + PSS_SHIFT)));
> + print_name_value_kb(m, "KernelPageSize: ", vma_kernel_pagesize(vma) >> 10);
> + print_name_value_kb(m, "MMUPageSize: ", vma_mmu_pagesize(vma) >> 10);
> + print_name_value_kb(m, "Locked: ", (vma->vm_flags & VM_LOCKED) ?
> (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);
> -
> arch_show_smap(m, vma);
> show_smap_vma_flags(m, vma);
> m_cache_vma(m, vma);
>
Powered by blists - more mailing lists