[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAmzW4OW2h5T-uf-neG=zWAo8Ozw7zK_79zx0ZqTwZWX3Dy2fg@mail.gmail.com>
Date: Tue, 24 Mar 2015 02:23:05 +0900
From: Joonsoo Kim <js1304@...il.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
Ingo Molnar <mingo@...nel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Jiri Olsa <jolsa@...hat.com>,
LKML <linux-kernel@...r.kernel.org>,
David Ahern <dsahern@...il.com>,
Minchan Kim <minchan@...nel.org>
Subject: Re: [PATCHSET 0/5] perf kmem: Implement page allocation analysis (v3)
Hello, Namhyung.
2015-03-23 15:30 GMT+09:00 Namhyung Kim <namhyung@...nel.org>:
> Hello,
>
> Currently perf kmem command only analyzes SLAB memory allocation. And
> I'd like to introduce page allocation analysis also. Users can use
> --slab and/or --page option to select it. If none of these options
> are used, it does slab allocation analysis for backward compatibility.
>
> * changes in v3)
> - add live page statistics
>
> * changes in v2)
> - Use thousand grouping for big numbers - i.e. 12345 -> 12,345 (Ingo)
> - Improve output stat readability (Ingo)
> - Remove alloc size column as it can be calculated from hits and order
>
> Patch 1 is to support thousand grouping on stat output. Patch 2
> implements basic support for page allocation analysis, patch 3 deals
> with the callsite and finally patch 4 implements sorting.
>
> In this patchset, I used two kmem events: kmem:mm_page_alloc and
> kmem_page_free for analysis as they can track almost all of memory
> allocation/free path AFAIK. However, unlike slab tracepoint events,
> those page allocation events don't provide callsite info directly. So
> I recorded callchains and extracted callsites like below:
>
> Normal page allocation callchains look like this:
>
> 360a7e __alloc_pages_nodemask
> 3a711c alloc_pages_current
> 357bc7 __page_cache_alloc <-- callsite
> 357cf6 pagecache_get_page
> 48b0a prepare_pages
> 494d3 __btrfs_buffered_write
> 49cdf btrfs_file_write_iter
> 3ceb6e new_sync_write
> 3cf447 vfs_write
> 3cff99 sys_write
> 7556e9 system_call
> f880 __write_nocancel
> 33eb9 cmd_record
> 4b38e cmd_kmem
> 7aa23 run_builtin
> 27a9a main
> 20800 __libc_start_main
>
> But first two are internal page allocation functions so it should be
> skipped. To determine such allocation functions, I used following regex:
>
> ^_?_?(alloc|get_free|get_zeroed)_pages?
>
> This gave me a following list of functions (you can see this with -v):
>
> alloc func: __get_free_pages
> alloc func: get_zeroed_page
> alloc func: alloc_pages_exact
> alloc func: __alloc_pages_direct_compact
> alloc func: __alloc_pages_nodemask
> alloc func: alloc_page_interleave
> alloc func: alloc_pages_current
> alloc func: alloc_pages_vma
> alloc func: alloc_page_buffers
> alloc func: alloc_pages_exact_nid
>
> After skipping those function, it got '__page_cache_alloc'.
It'd be better to have option for storing more depth of call stack.
Just one call path isn't sufficient to distinguish real caller
for some functions. For example, new_slab(), one of your callsite
example doesn't tell which subsystem try to allocate slab object and
fall through the page allocator.
> Other information such as allocation order, migration type and gfp
> flags are provided by tracepoint events.
>
> Basically the output will be sorted by total allocation bytes, but you
> can change it by using -s/--sort option. The following sort keys are
> added to support page analysis: page, order, mtype, gfp. Existing
> 'callsite', 'bytes' and 'hit' sort keys also can be used.
>
> An example follows:
>
> # perf kmem record --slab --page sleep 1
> [ perf record: Woken up 0 times to write data ]
> [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
>
> # perf kmem stat --page --caller -l 10 -s order,hit
>
> --------------------------------------------------------------------------------------------
> Total alloc (KB) | Hits | Order | Migration type | GFP flags | Callsite
> --------------------------------------------------------------------------------------------
> 64 | 4 | 2 | RECLAIMABLE | 00285250 | new_slab
> 50,144 | 12,536 | 0 | MOVABLE | 0102005a | __page_cache_alloc
> 52 | 13 | 0 | UNMOVABLE | 002084d0 | pte_alloc_one
> 40 | 10 | 0 | MOVABLE | 000280da | handle_mm_fault
> 28 | 7 | 0 | UNMOVABLE | 000000d0 | __pollwait
> 20 | 5 | 0 | MOVABLE | 000200da | do_wp_page
> 20 | 5 | 0 | MOVABLE | 000200da | do_cow_fault
> 16 | 4 | 0 | UNMOVABLE | 00000200 | __tlb_remove_page
> 16 | 4 | 0 | UNMOVABLE | 000084d0 | __pmd_alloc
> 8 | 2 | 0 | UNMOVABLE | 000084d0 | __pud_alloc
> ... | ... | ... | ... | ... | ...
> --------------------------------------------------------------------------------------------
How about printing GFP flags more intuitively, for example,
GFP_NOFS|GFP_ZERO? Tracepoint on mm_page_alloc already print
output as this format.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists