[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150312104119.GA5978@gmail.com>
Date:	Thu, 12 Mar 2015 11:41:19 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Namhyung Kim <namhyung@...nel.org>
Cc:	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Jiri Olsa <jolsa@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	David Ahern <dsahern@...il.com>,
	Minchan Kim <minchan@...nel.org>,
	Joonsoo Kim <js1304@...il.com>
Subject: Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis
 (v1)
* Namhyung Kim <namhyung@...nel.org> wrote:
> Hello,
> 
> Currently perf kmem command only analyzes SLAB memory allocation.  And
> I'd like to introduce page allocation analysis also.  Users can use
>  --slab and/or --page option to select it.  If none of these options
> are used, it does slab allocation analysis for backward compatibility.
> 
> The patch 1-3 are bugfix and cleanups.  Patch 4 implements basic
> support for page allocation analysis, patch 5 deals with the callsite
> and finally patch 6 implements sorting.
> 
> In this patchset, I used two kmem events: kmem:mm_page_alloc and
> kmem_page_free for analysis as they can track every memory
> allocation/free path AFAIK.  However, unlike slab tracepoint events,
> those page allocation events don't provide callsite info directly.  So
> I recorded callchains and extracted callsites like below:
Really cool features!
I have a couple of output typography observations:
> Normal page allocation callchains look like this:
> 
>   360a7e __alloc_pages_nodemask
>   3a711c alloc_pages_current
>   357bc7 __page_cache_alloc   <-- callsite
>   357cf6 pagecache_get_page
>    48b0a prepare_pages
>    494d3 __btrfs_buffered_write
>    49cdf btrfs_file_write_iter
>   3ceb6e new_sync_write
>   3cf447 vfs_write
>   3cff99 sys_write
>   7556e9 system_call
>     f880 __write_nocancel
>    33eb9 cmd_record
>    4b38e cmd_kmem
>    7aa23 run_builtin
>    27a9a main
>    20800 __libc_start_main
> 
> But first two are internal page allocation functions so it should be
> skipped.  To determine such allocation functions, I used following regex:
> 
>   ^_?_?(alloc|get_free|get_zeroed)_pages?
> 
> This gave me a following list of functions (you can see this with -v):
> 
>   alloc func: __get_free_pages
>   alloc func: get_zeroed_page
>   alloc func: alloc_pages_exact
>   alloc func: __alloc_pages_direct_compact
>   alloc func: __alloc_pages_nodemask
>   alloc func: alloc_page_interleave
>   alloc func: alloc_pages_current
>   alloc func: alloc_pages_vma
>   alloc func: alloc_page_buffers
>   alloc func: alloc_pages_exact_nid
> 
> After skipping those function, it got '__page_cache_alloc'.
> 
> Other information such as allocation order, migration type and gfp
> flags are provided by tracepoint events.
> 
> Basically the output will be sorted by total allocation bytes, but you
> can change it by using -s/--sort option.  The following sort keys are
> added to support page analysis: page, order, mtype, gfp.  Existing
> 'callsite', 'bytes' and 'hit' sort keys also can be used.
> 
> An example follows:
> 
>   # perf kmem record --slab --page sleep 1
>   [ perf record: Woken up 0 times to write data ]
>   [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
> 
>   # perf kmem stat --page --caller -l 10 -s order,hit
> 
>   --------------------------------------------------------------------------------------------
>    Total_alloc/Per | Hit      | Order | Migrate type | GFP flag | Callsite
s/Per/Size
s/Hit/Hits
s/Migrate type/Migration type
s/GFP flag/GFP flags
?
>   --------------------------------------------------------------------------------------------
>        65536/16384 |        4 |     2 |  RECLAIMABLE | 00285250 | new_slab
>     51347456/4096  |    12536 |     0 |      MOVABLE | 0102005a | __page_cache_alloc
>        53248/4096  |       13 |     0 |    UNMOVABLE | 002084d0 | pte_alloc_one
>        40960/4096  |       10 |     0 |      MOVABLE | 000280da | handle_mm_fault
>        28672/4096  |        7 |     0 |    UNMOVABLE | 000000d0 | __pollwait
>        20480/4096  |        5 |     0 |      MOVABLE | 000200da | do_wp_page
>        20480/4096  |        5 |     0 |      MOVABLE | 000200da | do_cow_fault
>        16384/4096  |        4 |     0 |    UNMOVABLE | 00000200 | __tlb_remove_page
>        16384/4096  |        4 |     0 |    UNMOVABLE | 000084d0 | __pmd_alloc
>         8192/4096  |        2 |     0 |    UNMOVABLE | 000084d0 | __pud_alloc
>    ...             | ...      | ...   | ...          | ...      | ...
>   --------------------------------------------------------------------------------------------
> 
>   SUMMARY (page allocator)
>   ========================
>   Total alloc requested: 12593
>   Total alloc failure  : 0
>   Total bytes allocated: 51630080
>   Total free  requested: 115
>   Total free  unmatched: 67
>   Total bytes freed    : 471040
I'd suggest the following changes to the format:
  - Collapse stats into 3 groups: 'allocated+freed', 'allocated only', 
    'freed only', depending on how much of their lifetime we've 
    managed to trace. These groups are really distinct and it makes 
    little sense to mix up their stats.
  - Add commas to the numbers, to make it easier to read and compare 
    larger numbers.
  - Right-align the numbers, to make them easy to compare when they
    are placed under each other.
  - Merge the 'count' and 'bytes' stats into a single line, so that 
    it's more compact, easier to navigate, but also only comparable 
    type numbers are placed under each other.
I.e. something like this (mockup) output:
   SUMMARY (page allocator)
   ========================
   Pages allocated+freed:       12,593   [     51,630,080 bytes ]
   Pages allocated-only:         2,342   [      1,235,010 bytes ]
   Pages freed-only:                67   [        135,311 bytes ]
   Page allocation failures :        0
>   Order     UNMOVABLE   RECLAIMABLE       MOVABLE      RESERVED   CMA/ISOLATE
>   -----  ------------  ------------  ------------  ------------  ------------
>       0            32             0         12557             0             0
>       1             0             0             0             0             0
>       2             0             4             0             0             0
>       3             0             0             0             0             0
>       4             0             0             0             0             0
>       5             0             0             0             0             0
>       6             0             0             0             0             0
>       7             0             0             0             0             0
>       8             0             0             0             0             0
>       9             0             0             0             0             0
>      10             0             0             0             0             0
Here I'd suggest the following refinements:
 - Use '.' instead of '0', to make actual nonzero values stand out 
   visually, while still keeping a tabular format
 - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special' 
   colum: this will be zero in 99.9% of the cases, as those pages 
   mostly deal with driver interfaces, mostly used during init/deinit.
 - Capitalize less.
 - Use comma-separated numbers for better readability.
So something like this:
   Order     Unmovable   Reclaimable       Movable       Special
   -----  ------------  ------------  ------------  ------------
       0            32             .        12,557             .
       1             .             .             .             .
       2             .             4             .             .
       3             .             .             .             .
       4             .             .             .             .
       5             .             .             .             .
       6             .             .             .             .
       7             .             .             .             .
       8             .             .             .             .
       9             .             .             .             .
      10             .             .             .             .
Look for example how easily noticeable the '4' value is now, while it 
was pretty easy to miss in the original table.
> I have some idea how to improve it.  But I'd also like to hear other 
> idea, suggestion, feedback and so on.
So there's one thing that would be useful: to track pages allocated on 
one node, but freed on another. Those kinds of allocation/free 
patterns are especially expensive and might make sense to visualize.
Thanks,
	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Powered by blists - more mailing lists