linux-kernel - Re: [PATCH v2] perf script python: integrate page reclaim analyze script

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191001144524.GB3321@techsingularity.net>
Date:   Tue, 1 Oct 2019 15:45:24 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Yafang Shao <laoar.shao@...il.com>
Cc:     tonyj@...e.com, acme@...nel.org, peterz@...radead.org,
        mingo@...hat.com, alexander.shishkin@...ux.intel.com,
        jolsa@...hat.com, namhyung@...nel.org, akpm@...ux-foundation.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Tony Jones <tonyj@...e.de>
Subject: Re: [PATCH v2] perf script python: integrate page reclaim analyze
 script

On Mon, Sep 30, 2019 at 11:19:44PM -0400, Yafang Shao wrote:
> A new perf script page-reclaim is introduced in this patch. This new script
> is used to report the page reclaim details. The possible usage of this
> script is as bellow,
> - identify latency spike caused by direct reclaim
> - whehter the latency spike is relevant with pageout
> - why is page reclaim requested, i.e. whether it is because of memory
>   fragmentation
> - page reclaim efficiency
> etc
> In the future we may also enhance it to analyze the memcg reclaim.
> 

Hi,

I ended up not reviewing this patch in detail simply because I would
approach the same class of problem in an entirely different way today.
There is value in accumulating the stats in a report like this;

>     $ perf script report page-reclaim
>     Direct reclaims: 4924
>     Direct latency (ms)        total         max         avg         min
>         	          177823.211    6378.977      36.114       0.051
>     Direct file reclaimed 22920
>     Direct file scanned 28306
>     Direct file sync write I/O 0
>     Direct file async write I/O 0
>     Direct anon reclaimed 212567
>     Direct anon scanned 1446854
>     Direct anon sync write I/O 0
>     Direct anon async write I/O 278325
>     Direct order      0     1     3
>         	   4870    23    31
>     Wake kswapd requests 716
>     Wake order      0     1
>         	  715     1
> 
>     Kswapd reclaims: 9

However, the basic option I would prefer is having the raw latency
information for Direct latency that can be externally parsed by R or any
other statistical method. The reason why is because knowing the max latency
is not enough, I'd want to know the spread of latencies and whether they
were clustered at a point of time or spread out over long periods of
time. I would then build the higher-level reports on top if necessary.

Today, I would also have considered getting the latency figures using eBPF
or systemtap instead although having perf do it may be useful too. That's
not universally popular though so at minimum I would have;

perf script record page-reclaim -- capture all page-reclaim tracepoints
perf script report page-reclaim -- For reclaim entry/exit, merge the two
	tracepoints into one that reports latency. Dump the rest out
	verbatim

For latencies, I would externally post-process them until such time as I
found a common class of bug that needed a high-level report and then
build the perf script support for it.

Please note that I did not spot anything wrong with your script, it's
just that I would not use it myself in its current format for debugging
a reclaim-related problem.

-- 
Mel Gorman
SUSE Labs