[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CACT4Y+Yh98VcNgmJ-gF9+inw=ZDkg1rRzi4_35f6krw8BBRpug@mail.gmail.com>
Date: Wed, 15 Jan 2025 08:11:51 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Ian Rogers <irogers@...gle.com>
Cc: Namhyung Kim <namhyung@...nel.org>, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, eranian@...gle.com,
Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>,
chu howard <howardchu95@...il.com>
Subject: Re: [PATCH v2] perf report: Add wall-clock and parallelism profiling
On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@...gle.com> wrote:
>
> On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@...gle.com> wrote:
> [snip]
> > FWIW I've also considered and started implementing a different
> > approach where the kernel would count parallelism level for each
> > context and write it out with samples:
> > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > Not sure how hard it is to make all corner cases work there, I dropped
> > it half way b/c the perf record post-processing looked like a better
> > approach.
>
> Nice. Just to focus on this point and go off on something of a
> tangent. I worry a little about perf_event_sample_format where we've
> used 25 out of the 64 bits of sample_type. Perhaps there will be a
> sample_type2 in the future. For the code and data page size it seems
> the same information could come from mmap events. You have a similar
> issue. I was thinking of another similar issue, adding information
> about the number of dirty pages in a VMA. I wonder if there is a
> better way to organize these things, rather than just keep using up
> bits in the perf_event_sample_format. For example, we could have a
> code page size software event that when in a leader sampling group
> with a hardware event with a sample IP provides the code page size
> information of the leader event's sample IP. We have loads of space in
> the types and config values to have an endless number of such events
> and maybe the value could be generated by a BPF program for yet more
> flexibility. What these events would mean without a leader sample
> event I'm not sure.
In the end I did not go with adding parallelism to each sample (this
is purely perf report change), so at least for this patch this is very
tangential :)
> Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> sampling. Wall clock time being off CPU plus on CPU. We need to do
> something to move forward the default flags/options for perf record,
> for example, we don't enable build ID mmap events by default causing
> the whole perf.data file to be scanned looking to add build ID events
> for the dsos with samples in them. One option that could be a default
> could be off-CPU profiling, and when permissions deny the BPF approach
> we can fallback on using events. If these events are there by default
> then it makes sense to hook them up in perf report.
Interesting. Do you mean "IO" by "off-CPU".
Yes, if a program was blocked for IO for 10 seconds (no CPU work),
then that obviously contributes to latency, but won't be in this
profile. Though, it still works well for a large number of important
cases (e.g. builds, ML inference, server request handling are
frequently not IO bound).
I was thinking how IO can be accounted for in the wall-clock profile.
Since we have SWITCH OUT events (and they already include preemption
bit), we do have info to account for blocked threads. But it gets
somewhat complex and has to make some hypotheses b/c not all blocked
threads contribute to latency (e.g. blocked watchdog thread). So I
left it out for now.
The idea was as follows.
We know a set of threads blocked at any point in time (switched out,
but not preempted).
We hypothesise that each of these could equally improve CPU load to
max if/when unblocked.
We inject synthetic samples in a leaf "IO wait" symbol with the weight
according to the hypothesis. I think these events can be injected
before each switch in event only (which changes the set of blocked
threads).
Namely:
If CPU load is already at max (parallelism == num cpus), we don't
inject any IO wait events.
If the number of blocked threads is 0, we don't inject any IO wait events.
If there are C idle CPUs and B blocked threads, we inject IO wait
events with weight C/B for each of them.
For example, if there is a single blocked thread, then we hypothesise
that this blocked thread is the root cause of all currently idle CPUs
being idle.
This still has a problem of saying that unrelated threads contribute
to latency, but at least it's a simple/explainable model and it should
show guilty threads as well. Maybe unrelated threads can be filtered
by the user by specifying a set of symbols in stacks of unrelated
threads.
Does it make any sense? Do you have anything better?
> Wrt perf report, I keep trying to push the python support in perf
> forward. These unmerged changes show an event being parsed, and ring
> buffer based sampling in a reasonably small number of lines of code in
> a way not dissimilar to a perf command line:
> https://lore.kernel.org/lkml/20250109075108.7651-12-irogers@google.com/
> Building a better UI on top of this in python means there are some
> reasonable frameworks that can be leveraged, I particularly like the
> look of textual:
> https://github.com/textualize/textual-demo
> which imo would move things a lot further forward than UI stuff in C
> and slang/stdio.
>
> Sorry for all this tangential stuff, I like the work and will try to
> delve into specifics later.
>
> Thanks,
> Ian
Powered by blists - more mailing lists