lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 12 Mar 2015 11:41:19 +0100
From:	Ingo Molnar <mingo@...nel.org>
To:	Namhyung Kim <namhyung@...nel.org>
Cc:	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Jiri Olsa <jolsa@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	David Ahern <dsahern@...il.com>,
	Minchan Kim <minchan@...nel.org>,
	Joonsoo Kim <js1304@...il.com>
Subject: Re: [RFC/PATCHSET 0/6] perf kmem: Implement page allocation analysis
 (v1)

* Namhyung Kim <namhyung@...nel.org> wrote:

> Hello,
> 
> Currently perf kmem command only analyzes SLAB memory allocation.  And
> I'd like to introduce page allocation analysis also.  Users can use
>  --slab and/or --page option to select it.  If none of these options
> are used, it does slab allocation analysis for backward compatibility.
> 
> The patch 1-3 are bugfix and cleanups.  Patch 4 implements basic
> support for page allocation analysis, patch 5 deals with the callsite
> and finally patch 6 implements sorting.
> 
> In this patchset, I used two kmem events: kmem:mm_page_alloc and
> kmem_page_free for analysis as they can track every memory
> allocation/free path AFAIK.  However, unlike slab tracepoint events,
> those page allocation events don't provide callsite info directly.  So
> I recorded callchains and extracted callsites like below:

Really cool features!

I have a couple of output typography observations:

> Normal page allocation callchains look like this:
> 
>   360a7e __alloc_pages_nodemask
>   3a711c alloc_pages_current
>   357bc7 __page_cache_alloc   <-- callsite
>   357cf6 pagecache_get_page
>    48b0a prepare_pages
>    494d3 __btrfs_buffered_write
>    49cdf btrfs_file_write_iter
>   3ceb6e new_sync_write
>   3cf447 vfs_write
>   3cff99 sys_write
>   7556e9 system_call
>     f880 __write_nocancel
>    33eb9 cmd_record
>    4b38e cmd_kmem
>    7aa23 run_builtin
>    27a9a main
>    20800 __libc_start_main
> 
> But first two are internal page allocation functions so it should be
> skipped.  To determine such allocation functions, I used following regex:
> 
>   ^_?_?(alloc|get_free|get_zeroed)_pages?
> 
> This gave me a following list of functions (you can see this with -v):
> 
>   alloc func: __get_free_pages
>   alloc func: get_zeroed_page
>   alloc func: alloc_pages_exact
>   alloc func: __alloc_pages_direct_compact
>   alloc func: __alloc_pages_nodemask
>   alloc func: alloc_page_interleave
>   alloc func: alloc_pages_current
>   alloc func: alloc_pages_vma
>   alloc func: alloc_page_buffers
>   alloc func: alloc_pages_exact_nid
> 
> After skipping those function, it got '__page_cache_alloc'.
> 
> Other information such as allocation order, migration type and gfp
> flags are provided by tracepoint events.
> 
> Basically the output will be sorted by total allocation bytes, but you
> can change it by using -s/--sort option.  The following sort keys are
> added to support page analysis: page, order, mtype, gfp.  Existing
> 'callsite', 'bytes' and 'hit' sort keys also can be used.
> 
> An example follows:
> 
>   # perf kmem record --slab --page sleep 1
>   [ perf record: Woken up 0 times to write data ]
>   [ perf record: Captured and wrote 49.277 MB perf.data (191027 samples) ]
> 
>   # perf kmem stat --page --caller -l 10 -s order,hit
> 
>   --------------------------------------------------------------------------------------------
>    Total_alloc/Per | Hit      | Order | Migrate type | GFP flag | Callsite

s/Per/Size
s/Hit/Hits
s/Migrate type/Migration type
s/GFP flag/GFP flags

?

>   --------------------------------------------------------------------------------------------
>        65536/16384 |        4 |     2 |  RECLAIMABLE | 00285250 | new_slab
>     51347456/4096  |    12536 |     0 |      MOVABLE | 0102005a | __page_cache_alloc
>        53248/4096  |       13 |     0 |    UNMOVABLE | 002084d0 | pte_alloc_one
>        40960/4096  |       10 |     0 |      MOVABLE | 000280da | handle_mm_fault
>        28672/4096  |        7 |     0 |    UNMOVABLE | 000000d0 | __pollwait
>        20480/4096  |        5 |     0 |      MOVABLE | 000200da | do_wp_page
>        20480/4096  |        5 |     0 |      MOVABLE | 000200da | do_cow_fault
>        16384/4096  |        4 |     0 |    UNMOVABLE | 00000200 | __tlb_remove_page
>        16384/4096  |        4 |     0 |    UNMOVABLE | 000084d0 | __pmd_alloc
>         8192/4096  |        2 |     0 |    UNMOVABLE | 000084d0 | __pud_alloc
>    ...             | ...      | ...   | ...          | ...      | ...
>   --------------------------------------------------------------------------------------------
> 
>   SUMMARY (page allocator)
>   ========================
>   Total alloc requested: 12593
>   Total alloc failure  : 0
>   Total bytes allocated: 51630080
>   Total free  requested: 115
>   Total free  unmatched: 67
>   Total bytes freed    : 471040

I'd suggest the following changes to the format:

  - Collapse stats into 3 groups: 'allocated+freed', 'allocated only', 
    'freed only', depending on how much of their lifetime we've 
    managed to trace. These groups are really distinct and it makes 
    little sense to mix up their stats.

  - Add commas to the numbers, to make it easier to read and compare 
    larger numbers.

  - Right-align the numbers, to make them easy to compare when they
    are placed under each other.

  - Merge the 'count' and 'bytes' stats into a single line, so that 
    it's more compact, easier to navigate, but also only comparable 
    type numbers are placed under each other.

I.e. something like this (mockup) output:

   SUMMARY (page allocator)
   ========================

   Pages allocated+freed:       12,593   [     51,630,080 bytes ]

   Pages allocated-only:         2,342   [      1,235,010 bytes ]
   Pages freed-only:                67   [        135,311 bytes ]

   Page allocation failures :        0


>   Order     UNMOVABLE   RECLAIMABLE       MOVABLE      RESERVED   CMA/ISOLATE
>   -----  ------------  ------------  ------------  ------------  ------------
>       0            32             0         12557             0             0
>       1             0             0             0             0             0
>       2             0             4             0             0             0
>       3             0             0             0             0             0
>       4             0             0             0             0             0
>       5             0             0             0             0             0
>       6             0             0             0             0             0
>       7             0             0             0             0             0
>       8             0             0             0             0             0
>       9             0             0             0             0             0
>      10             0             0             0             0             0

Here I'd suggest the following refinements:

 - Use '.' instead of '0', to make actual nonzero values stand out 
   visually, while still keeping a tabular format

 - Merge the 'Reserved', 'CMA/Isolate' columns into a single 'Special' 
   colum: this will be zero in 99.9% of the cases, as those pages 
   mostly deal with driver interfaces, mostly used during init/deinit.

 - Capitalize less.

 - Use comma-separated numbers for better readability.

So something like this:


   Order     Unmovable   Reclaimable       Movable       Special
   -----  ------------  ------------  ------------  ------------
       0            32             .        12,557             .
       1             .             .             .             .
       2             .             4             .             .
       3             .             .             .             .
       4             .             .             .             .
       5             .             .             .             .
       6             .             .             .             .
       7             .             .             .             .
       8             .             .             .             .
       9             .             .             .             .
      10             .             .             .             .


Look for example how easily noticeable the '4' value is now, while it 
was pretty easy to miss in the original table.

> I have some idea how to improve it.  But I'd also like to hear other 
> idea, suggestion, feedback and so on.

So there's one thing that would be useful: to track pages allocated on 
one node, but freed on another. Those kinds of allocation/free 
patterns are especially expensive and might make sense to visualize.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ