linux-kernel - Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCveO4qQGy03ow5p@google.com>
Date: Mon, 19 May 2025 18:43:23 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Dmitry Vyukov <dvyukov@...gle.com>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ian Rogers <irogers@...gle.com>,
	Kan Liang <kan.liang@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
	Adrian Hunter <adrian.hunter@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
	linux-perf-users@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC/PATCH] perf report: Support latency profiling in
 system-wide mode

On Mon, May 19, 2025 at 08:00:49AM +0200, Dmitry Vyukov wrote:
> On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > Hello,
> >
> > Sorry for the delay.
> >
> > On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@...nel.org> wrote:
> > > >
> > > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@...nel.org> wrote:
> > > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > > >
> > > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > > >
> > > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > > >
> > > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > > >
> > > > > > We may disable latency column by default in this case and show warning
> > > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > > records only for idle tasks and enable the latency report only if the
> > > > > > data has sched-switch records.
> > > > > >
> > > > > > What do you think?
> > > > >
> > > > > Depends on what problem we are trying to solve:
> > > > >
> > > > > 1. Enabling latency profiling for system-wide mode.
> > > > >
> > > > > 2. Switch events bloating trace too much.
> > > > >
> > > > > 3. Lost switch events lead to imprecise accounting.
> > > > >
> > > > > The patch mentions all 3 :)
> > > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > > An active single process profile can emit more samples than a
> > > > > system-wide profile on a lightly loaded system.
> > > >
> > > > True.  But we don't need to care about lightly loaded systems as they
> > > > won't cause problems.
> > > >
> > > >
> > > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > > equally subject to the lost events problem.
> > > >
> > > > Right, but I'm afraid practically it'll increase the chance of lost
> > > > in system-wide mode.  The default size of the sample for system-wide
> > > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > > >
> > > > >
> > > > > For problem 1: we can just permit --latency for system wide mode and
> > > > > fully rely on switch events.
> > > > > It's not any worse than we do now (wrt both profile size and lost events).
> > > >
> > > > This can be an option and it'd work well on lightly loaded systems.
> > > > Maybe we can just try it first.  But I think it's better to have an
> > > > option to make it work on heavily loaded systems.
> > > >
> > > > >
> > > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > > we want, then your current accounting code will work w/o any changes.
> > > > > This should help wrt trace size only for system-wide mode (provided
> > > > > that user already enables CPU accounting for other reasons, otherwise
> > > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > > switch events).
> > > >
> > > > I'm not sure how we can add the fake samples.  The switch events will be
> > > > from the kernel and we may add the condition in the attribute.
> > > >
> > > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > > >
> > > > >
> > > > > For problem 3: switches to idle task won't really help. There can be
> > > > > lots of them, and missing any will lead to wrong accounting.
> > > >
> > > > I don't know how severe the situation will be.  On heavily loaded
> > > > systems, the idle task won't run much and data size won't increase.
> > > > On lightly loaded systems, increased data will likely be handled well.
> > > >
> > > >
> > > > > A principled approach would be to attach a per-thread scheduler
> > > > > quantum sequence number to each CPU sample. The sequence number would
> > > > > be incremented on every context switch. Then any subset of CPU should
> > > > > be enough to understand when a task was scheduled in and out
> > > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > > switched out on the last sample with sequence number N).
> > > >
> > > > I'm not sure how it can help.  We don't need the switch info itself.
> > > > What's needed is when the CPU was idle, right?
> > >
> > > I mean the following.
> > > Each sample has a TID.
> > > We add a SEQ field, which is per-thread and is incremented after every
> > > rescheduling of the thread.
> > >
> > > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > > OUT event for this thread at this timestamp. When we see the first
> > > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > > thread at this timestamp.
> > >
> > > These SCHED IN/OUT events are not injected by the kernel. We just
> > > pretend they happen for accounting purposes. We may actually
> > > materialize them in the perf tool, or me may just update parallelism
> > > as if they happen.
> >
> > Thanks for the explanation.  But I don't think it needs the SEQ and
> > SCHED IN/OUT generated from it to track lost records.  Please see below.
> >
> > >
> > > With this scheme we can lose absolutely any subset of samples, and
> > > still get very precise accounting. When we lose samples, the profile
> > > of course becomes a bit less precise, but the effect is local and
> > > recoverable.
> > >
> > > If we lose the last/first event for (TID,SEQ), then we slightly
> > > shorten/postpone the thread accounting in the process parallelism
> > > level. If we lose a middle (TID,SEQ), then parallelism is not
> > > affected.
> >
> > I'm afraid it cannot check parallelism by just seeing the current thread.
> > I guess it would need information from other threads even if it has same
> > SEQ.
> 
> Yes, we still count parallelism like you do in this patch, we just use
> the SEQ info instead of CPU numbers and explicit switch events.

I mean after record lost, let's say

  t1: SAMPLE for TID 1234, seq 10  (parallelism = 4)
  t2: LOST
  t3: SAMPLE for TID 1234, seq 10  (parallelism = ?)

I don't think we can continue to use parallelism of 4 after LOST even if
it has the same seq because it cannot know if other threads switched on
other CPUs.  Then do we need really the seq?

> 
> > Also postpone thread accounting can be complex.  I think it should wait
> > for all other threads to get a sample.  Maybe some threads exited and
> > lost too.
> 
> Yes, in order to understand what's the last event for (TID,SEQ) we
> need to look ahead and find the event (TID,SEQ+1). The easiest way to
> do it would be to do 2 passes over the trace. That's the cost of
> saving trace space + being resilient to lost events.
> 
> Do you see any other issues with this scheme besides requiring 2 passes?

Well.. 2 pass itself can be a problem due to slowness it'd bring.  Some
people complain about the speed of perf report as of now.

I think we can simply reset the parallelism in all processes after LOST
and set current process to the idle task.  It'll catch up as soon as it
sees samples from all CPUs.

> 
> 
> > In my approach, I can clear the current thread from all CPUs when it
> > sees a LOST record and restart calculation of parallelism.
> >
> > >
> > > The switches from a thread to itself is not a problem. We will just
> > > inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> > > now when the kernel injects these events.
> > >
> > > But if we switch to idle task and got no samples for some period of
> > > time on the CPU, then we properly inject SCHED OUT/IN that will
> > > account for the thread not actually running.
> >
> > Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> > CPU and clear the current thread after some period (2x of given freq?).
> > But it may slow down perf report as it'd check all CPUs for each sample.
> >
> > Thanks,
> > Namhyung
> >