linux-kernel - Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+YMAM1eMEEWhGsOcGqPT2bn+4FRSp5ORgF0Qji8nmBzdQ@mail.gmail.com>
Date: Sun, 19 Jan 2025 12:08:36 +0100
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Ian Rogers <irogers@...gle.com>, linux-perf-users@...r.kernel.org, 
	LKML <linux-kernel@...r.kernel.org>, Stephane Eranian <eranian@...gle.com>, 
	Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	chu howard <howardchu95@...il.com>
Subject: Off-CPU sampling (was perf report: Add wall-clock and parallelism profiling)

On Thu, 16 Jan 2025 at 19:55, Namhyung Kim <namhyung@...nel.org> wrote:
>
> On Wed, Jan 15, 2025 at 08:11:51AM +0100, Dmitry Vyukov wrote:
> > On Wed, 15 Jan 2025 at 06:59, Ian Rogers <irogers@...gle.com> wrote:
> > >
> > > On Tue, Jan 14, 2025 at 12:27 AM Dmitry Vyukov <dvyukov@...gle.com> wrote:
> > > [snip]
> > > > FWIW I've also considered and started implementing a different
> > > > approach where the kernel would count parallelism level for each
> > > > context and write it out with samples:
> > > > https://github.com/dvyukov/linux/commit/56ee1f638ac1597a800a30025f711ab496c1a9f2
> > > > Then sched_in/out would need to do atomic inc/dec on a global counter.
> > > > Not sure how hard it is to make all corner cases work there, I dropped
> > > > it half way b/c the perf record post-processing looked like a better
> > > > approach.
> > >
> > > Nice. Just to focus on this point and go off on something of a
> > > tangent. I worry a little about perf_event_sample_format where we've
> > > used 25 out of the 64 bits of sample_type. Perhaps there will be a
> > > sample_type2 in the future. For the code and data page size it seems
> > > the same information could come from mmap events. You have a similar
> > > issue. I was thinking of another similar issue, adding information
> > > about the number of dirty pages in a VMA. I wonder if there is a
> > > better way to organize these things, rather than just keep using up
> > > bits in the perf_event_sample_format. For example, we could have a
> > > code page size software event that when in a leader sampling group
> > > with a hardware event with a sample IP provides the code page size
> > > information of the leader event's sample IP. We have loads of space in
> > > the types and config values to have an endless number of such events
> > > and maybe the value could be generated by a BPF program for yet more
> > > flexibility. What these events would mean without a leader sample
> > > event I'm not sure.
> >
> > In the end I did not go with adding parallelism to each sample (this
> > is purely perf report change), so at least for this patch this is very
> > tangential :)
> >
> > > Wrt wall clock time, Howard Chu has done some work advancing off-CPU
> > > sampling. Wall clock time being off CPU plus on CPU. We need to do
> > > something to move forward the default flags/options for perf record,
> > > for example, we don't enable build ID mmap events by default causing
> > > the whole perf.data file to be scanned looking to add build ID events
> > > for the dsos with samples in them. One option that could be a default
> > > could be off-CPU profiling, and when permissions deny the BPF approach
> > > we can fallback on using events. If these events are there by default
> > > then it makes sense to hook them up in perf report.
> >
> > Interesting. Do you mean "IO" by "off-CPU".
> > Yes, if a program was blocked for IO for 10 seconds (no CPU work),
> > then that obviously contributes to latency, but won't be in this
> > profile. Though, it still works well for a large number of important
> > cases (e.g. builds, ML inference, server request handling are
> > frequently not IO bound).
> >
> > I was thinking how IO can be accounted for in the wall-clock profile.
> > Since we have SWITCH OUT events (and they already include preemption
> > bit), we do have info to account for blocked threads. But it gets
> > somewhat complex and has to make some hypotheses b/c not all blocked
> > threads contribute to latency (e.g. blocked watchdog thread). So I
> > left it out for now.
> >
> > The idea was as follows.
> > We know a set of threads blocked at any point in time (switched out,
> > but not preempted).
> > We hypothesise that each of these could equally improve CPU load to
> > max if/when unblocked.
> > We inject synthetic samples in a leaf "IO wait" symbol with the weight
> > according to the hypothesis. I think these events can be  injected
> > before each switch in event only (which changes the set of blocked
> > threads).
> >
> > Namely:
> > If CPU load is already at max (parallelism == num cpus), we don't
> > inject any IO wait events.
> > If the number of blocked threads is 0, we don't inject any IO wait events.
> > If there are C idle CPUs and B blocked threads, we inject IO wait
> > events with weight C/B for each of them.
>
> To track idle CPUs, you need sched-switch of all CPUs regardless of your
> workload, right?  Also I'm not sure when do you want to inject the IO
> wait events - when a thread is sched-out without preemption?  And what
> would be the weight?  I guess you want something like:
>
>   blocked time * C / B
>
> Then C and B can change before the thread is woken up.

Yes, these events need to be emitted on every switch-in/out in the
trace so that a long blocked thread gets multiple events with
different weights.

> > For example, if there is a single blocked thread, then we hypothesise
> > that this blocked thread is the root cause of all currently idle CPUs
> > being idle.
>
> I think this may make sense when you target a single process group but
> it also needs system-wide idle information.

I assumed this profiling is done on a mostly idle system (generally
it's a good idea for any profiling).

Theoretically, we could look at runnable threads rather than running.
If there are NumCPU runnable threads, then creating more runnable
threads won't help.

> > This still has a problem of saying that unrelated threads contribute
> > to latency, but at least it's a simple/explainable model and it should
> > show guilty threads as well. Maybe unrelated threads can be filtered
> > by the user by specifying a set of symbols in stacks of unrelated
> > threads.
>
> Or by task name.
>
> >
> > Does it make any sense? Do you have anything better?
>
> I'm not sure if it's right to use idle state which will be affected by
> unrelated processes.  Maybe it's good for system-wide profiling.
>
> For a process (group) profiliing, I think you need to consider number of
> total threads, active threads, and CPUs.  And if the #active-threads is
> less than min(#total-threads, #CPUs), then it could be considered as
> idle from the workload's perspective.
>
> What do you think?

I don't know, hard to say. I see what you mean, but this makes the
problem even harder, and potentially breaking hypotheses we are
making.
For example, if we have 2 unrelated workloads A and B running on the
machine. Their high- and low-parallelism phases will overlap randomly,
and we will make conclusions from that, but these overlapping are
really random and may not hold next time. Or next time A may be
collocated with C.

I would solve the simpler problem of profiling a single workload on a
mostly idle system first, and only then move to the harder case. Are
you considering this for GWP-type profiling?