linux-kernel - Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACT4Y+YHxXjCU2jySTUO5kH=xC8scdzTTuP2qEBc5zMber44Aw@mail.gmail.com>
Date: Mon, 19 May 2025 08:00:49 +0200
From: Dmitry Vyukov <dvyukov@...gle.com>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>, Ian Rogers <irogers@...gle.com>, 
	Kan Liang <kan.liang@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, Peter Zijlstra <peterz@...radead.org>, 
	Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>, 
	linux-perf-users@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>
Subject: Re: [RFC/PATCH] perf report: Support latency profiling in system-wide mode

On Fri, 16 May 2025 at 18:33, Namhyung Kim <namhyung@...nel.org> wrote:
>
> Hello,
>
> Sorry for the delay.
>
> On Thu, May 08, 2025 at 02:24:08PM +0200, Dmitry Vyukov wrote:
> > On Thu, 8 May 2025 at 01:43, Namhyung Kim <namhyung@...nel.org> wrote:
> > >
> > > On Tue, May 06, 2025 at 09:40:52AM +0200, Dmitry Vyukov wrote:
> > > > On Tue, 6 May 2025 at 09:10, Namhyung Kim <namhyung@...nel.org> wrote:
> > > > > > > > Where does the patch check that this mode is used only for system-wide profiles?
> > > > > > > > Is it that PERF_SAMPLE_CPU present only for system-wide profiles?
> > > > > > >
> > > > > > > Basically yes, but you can use --sample-cpu to add it.
> > > > > >
> > > > > > Are you sure? --sample-cpu seems to work for non-system-wide profiles too.
> > > > >
> > > > > Yep, that's why I said "Basically".  So it's not 100% guarantee.
> > > > >
> > > > > We may disable latency column by default in this case and show warning
> > > > > if it's requested.  Or we may add a new attribute to emit sched-switch
> > > > > records only for idle tasks and enable the latency report only if the
> > > > > data has sched-switch records.
> > > > >
> > > > > What do you think?
> > > >
> > > > Depends on what problem we are trying to solve:
> > > >
> > > > 1. Enabling latency profiling for system-wide mode.
> > > >
> > > > 2. Switch events bloating trace too much.
> > > >
> > > > 3. Lost switch events lead to imprecise accounting.
> > > >
> > > > The patch mentions all 3 :)
> > > > But I think 2 and 3 are not really specific to system-wide mode.
> > > > An active single process profile can emit more samples than a
> > > > system-wide profile on a lightly loaded system.
> > >
> > > True.  But we don't need to care about lightly loaded systems as they
> > > won't cause problems.
> > >
> > >
> > > > Similarly, if we rely on switch events for system-wide mode, then it's
> > > > equally subject to the lost events problem.
> > >
> > > Right, but I'm afraid practically it'll increase the chance of lost
> > > in system-wide mode.  The default size of the sample for system-wide
> > > is 56 byte and the size of the switch is 48 byte.  And the default
> > > sample frequency is 4000 Hz but it cannot control the rate of the
> > > switch.  I saw around 10000 Hz of switches per CPU on my work env.
> > >
> > > >
> > > > For problem 1: we can just permit --latency for system wide mode and
> > > > fully rely on switch events.
> > > > It's not any worse than we do now (wrt both profile size and lost events).
> > >
> > > This can be an option and it'd work well on lightly loaded systems.
> > > Maybe we can just try it first.  But I think it's better to have an
> > > option to make it work on heavily loaded systems.
> > >
> > > >
> > > > For problem 2: yes, we could emit only switches to idle tasks. Or
> > > > maybe just a fake CPU sample for an idle task? That's effectively what
> > > > we want, then your current accounting code will work w/o any changes.
> > > > This should help wrt trace size only for system-wide mode (provided
> > > > that user already enables CPU accounting for other reasons, otherwise
> > > > it's unclear what's better -- attaching CPU to each sample, or writing
> > > > switch events).
> > >
> > > I'm not sure how we can add the fake samples.  The switch events will be
> > > from the kernel and we may add the condition in the attribute.
> > >
> > > And PERF_SAMPLE_CPU is on by default in system-wide mode.
> > >
> > > >
> > > > For problem 3: switches to idle task won't really help. There can be
> > > > lots of them, and missing any will lead to wrong accounting.
> > >
> > > I don't know how severe the situation will be.  On heavily loaded
> > > systems, the idle task won't run much and data size won't increase.
> > > On lightly loaded systems, increased data will likely be handled well.
> > >
> > >
> > > > A principled approach would be to attach a per-thread scheduler
> > > > quantum sequence number to each CPU sample. The sequence number would
> > > > be incremented on every context switch. Then any subset of CPU should
> > > > be enough to understand when a task was scheduled in and out
> > > > (scheduled in on the first CPU sample with sequence number N, and
> > > > switched out on the last sample with sequence number N).
> > >
> > > I'm not sure how it can help.  We don't need the switch info itself.
> > > What's needed is when the CPU was idle, right?
> >
> > I mean the following.
> > Each sample has a TID.
> > We add a SEQ field, which is per-thread and is incremented after every
> > rescheduling of the thread.
> >
> > When we see the last sample for (TID,SEQ), we pretend there is SCHED
> > OUT event for this thread at this timestamp. When we see the first
> > sample for (TID,SEQ+1), we pretend there is SCHED IN event for this
> > thread at this timestamp.
> >
> > These SCHED IN/OUT events are not injected by the kernel. We just
> > pretend they happen for accounting purposes. We may actually
> > materialize them in the perf tool, or me may just update parallelism
> > as if they happen.
>
> Thanks for the explanation.  But I don't think it needs the SEQ and
> SCHED IN/OUT generated from it to track lost records.  Please see below.
>
> >
> > With this scheme we can lose absolutely any subset of samples, and
> > still get very precise accounting. When we lose samples, the profile
> > of course becomes a bit less precise, but the effect is local and
> > recoverable.
> >
> > If we lose the last/first event for (TID,SEQ), then we slightly
> > shorten/postpone the thread accounting in the process parallelism
> > level. If we lose a middle (TID,SEQ), then parallelism is not
> > affected.
>
> I'm afraid it cannot check parallelism by just seeing the current thread.
> I guess it would need information from other threads even if it has same
> SEQ.

Yes, we still count parallelism like you do in this patch, we just use
the SEQ info instead of CPU numbers and explicit switch events.

> Also postpone thread accounting can be complex.  I think it should wait
> for all other threads to get a sample.  Maybe some threads exited and
> lost too.

Yes, in order to understand what's the last event for (TID,SEQ) we
need to look ahead and find the event (TID,SEQ+1). The easiest way to
do it would be to do 2 passes over the trace. That's the cost of
saving trace space + being resilient to lost events.

Do you see any other issues with this scheme besides requiring 2 passes?


> In my approach, I can clear the current thread from all CPUs when it
> sees a LOST record and restart calculation of parallelism.
>
> >
> > The switches from a thread to itself is not a problem. We will just
> > inject a SCHED OUT followed by SCHED IN. But exactly the same happens
> > now when the kernel injects these events.
> >
> > But if we switch to idle task and got no samples for some period of
> > time on the CPU, then we properly inject SCHED OUT/IN that will
> > account for the thread not actually running.
>
> Hmm.. ok.  Maybe we can save the timestamp of the last sample on each
> CPU and clear the current thread after some period (2x of given freq?).
> But it may slow down perf report as it'd check all CPUs for each sample.
>
> Thanks,
> Namhyung
>