[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191001144524.GB3321@techsingularity.net>
Date: Tue, 1 Oct 2019 15:45:24 +0100
From: Mel Gorman <mgorman@...hsingularity.net>
To: Yafang Shao <laoar.shao@...il.com>
Cc: tonyj@...e.com, acme@...nel.org, peterz@...radead.org,
mingo@...hat.com, alexander.shishkin@...ux.intel.com,
jolsa@...hat.com, namhyung@...nel.org, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Tony Jones <tonyj@...e.de>
Subject: Re: [PATCH v2] perf script python: integrate page reclaim analyze
script
On Mon, Sep 30, 2019 at 11:19:44PM -0400, Yafang Shao wrote:
> A new perf script page-reclaim is introduced in this patch. This new script
> is used to report the page reclaim details. The possible usage of this
> script is as bellow,
> - identify latency spike caused by direct reclaim
> - whehter the latency spike is relevant with pageout
> - why is page reclaim requested, i.e. whether it is because of memory
> fragmentation
> - page reclaim efficiency
> etc
> In the future we may also enhance it to analyze the memcg reclaim.
>
Hi,
I ended up not reviewing this patch in detail simply because I would
approach the same class of problem in an entirely different way today.
There is value in accumulating the stats in a report like this;
> $ perf script report page-reclaim
> Direct reclaims: 4924
> Direct latency (ms) total max avg min
> 177823.211 6378.977 36.114 0.051
> Direct file reclaimed 22920
> Direct file scanned 28306
> Direct file sync write I/O 0
> Direct file async write I/O 0
> Direct anon reclaimed 212567
> Direct anon scanned 1446854
> Direct anon sync write I/O 0
> Direct anon async write I/O 278325
> Direct order 0 1 3
> 4870 23 31
> Wake kswapd requests 716
> Wake order 0 1
> 715 1
>
> Kswapd reclaims: 9
However, the basic option I would prefer is having the raw latency
information for Direct latency that can be externally parsed by R or any
other statistical method. The reason why is because knowing the max latency
is not enough, I'd want to know the spread of latencies and whether they
were clustered at a point of time or spread out over long periods of
time. I would then build the higher-level reports on top if necessary.
Today, I would also have considered getting the latency figures using eBPF
or systemtap instead although having perf do it may be useful too. That's
not universally popular though so at minimum I would have;
perf script record page-reclaim -- capture all page-reclaim tracepoints
perf script report page-reclaim -- For reclaim entry/exit, merge the two
tracepoints into one that reports latency. Dump the rest out
verbatim
For latencies, I would externally post-process them until such time as I
found a common class of bug that needed a high-level report and then
build the perf script support for it.
Please note that I did not spot anything wrong with your script, it's
just that I would not use it myself in its current format for debugging
a reclaim-related problem.
--
Mel Gorman
SUSE Labs
Powered by blists - more mailing lists