linux-kernel - Re: perf regression. Was: [PATCH V4 01/16] perf: Fix the throttle logic for a group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQ+h24Sez9iaa9DdwS9sWQ4m1LXeXQM7XMPKfZO7FmUtMg@mail.gmail.com>
Date: Mon, 2 Jun 2025 11:14:59 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>, 
	Mark Rutland <mark.rutland@....com>, LKML <linux-kernel@...r.kernel.org>, 
	"linux-perf-use." <linux-perf-users@...r.kernel.org>, Stephane Eranian <eranian@...gle.com>, 
	Chun-Tse Shao <ctshao@...gle.com>, Thomas Richter <tmricht@...ux.ibm.com>, Leo Yan <leo.yan@....com>, 
	bpf <bpf@...r.kernel.org>, Andrii Nakryiko <andrii@...nel.org>, 
	Ihor Solodrai <ihor.solodrai@...ux.dev>, Song Liu <song@...nel.org>, Jiri Olsa <jolsa@...nel.org>
Subject: Re: perf regression. Was: [PATCH V4 01/16] perf: Fix the throttle
 logic for a group

On Mon, Jun 2, 2025 at 10:51 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2025-06-02 12:24 p.m., Alexei Starovoitov wrote:
> > On Mon, Jun 2, 2025 at 5:55 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> >>
> >> Hi Alexei,
> >>
> >> On 2025-06-01 8:30 p.m., Alexei Starovoitov wrote:
> >>> On Tue, May 20, 2025 at 11:16:29AM -0700, kan.liang@...ux.intel.com wrote:
> >>>> From: Kan Liang <kan.liang@...ux.intel.com>
> >>>>
> >>>> The current throttle logic doesn't work well with a group, e.g., the
> >>>> following sampling-read case.
> >>>>
> >>>> $ perf record -e "{cycles,cycles}:S" ...
> >>>>
> >>>> $ perf report -D | grep THROTTLE | tail -2
> >>>>             THROTTLE events:        426  ( 9.0%)
> >>>>           UNTHROTTLE events:        425  ( 9.0%)
> >>>>
> >>>> $ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> >>>> 0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
> >>>> ... sample_read:
> >>>> .... group nr 2
> >>>> ..... id 0000000000000327, value 000000000cbb993a, lost 0
> >>>> ..... id 0000000000000328, value 00000002211c26df, lost 0
> >>>>
> >>>> The second cycles event has a much larger value than the first cycles
> >>>> event in the same group.
> >>>>
> >>>> The current throttle logic in the generic code only logs the THROTTLE
> >>>> event. It relies on the specific driver implementation to disable
> >>>> events. For all ARCHs, the implementation is similar. Only the event is
> >>>> disabled, rather than the group.
> >>>>
> >>>> The logic to disable the group should be generic for all ARCHs. Add the
> >>>> logic in the generic code. The following patch will remove the buggy
> >>>> driver-specific implementation.
> >>>>
> >>>> The throttle only happens when an event is overflowed. Stop the entire
> >>>> group when any event in the group triggers the throttle.
> >>>> The MAX_INTERRUPTS is set to all throttle events.
> >>>>
> >>>> The unthrottled could happen in 3 places.
> >>>> - event/group sched. All events in the group are scheduled one by one.
> >>>>   All of them will be unthrottled eventually. Nothing needs to be
> >>>>   changed.
> >>>> - The perf_adjust_freq_unthr_events for each tick. Needs to restart the
> >>>>   group altogether.
> >>>> - The __perf_event_period(). The whole group needs to be restarted
> >>>>   altogether as well.
> >>>>
> >>>> With the fix,
> >>>> $ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
> >>>> 0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
> >>>> ... sample_read:
> >>>> .... group nr 2
> >>>> ..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
> >>>> ..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
> >>>>
> >>>> Suggested-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> >>>> Signed-off-by: Kan Liang <kan.liang@...ux.intel.com>
> >>>> ---
> >>>>  kernel/events/core.c | 66 ++++++++++++++++++++++++++++++--------------
> >>>>  1 file changed, 46 insertions(+), 20 deletions(-)
> >>>
> >>> This patch breaks perf hw events somehow.
> >>>
> >>> After merging this into bpf trees we see random "watchdog: BUG: soft lockup"
> >>> with various stack traces followed up:
> >>> [   78.620749] Sending NMI from CPU 8 to CPUs 0:
> >>> [   76.387722] NMI backtrace for cpu 0
> >>> [   76.387722] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G           O L      6.15.0-10818-ge0f0ee1c31de #1163 PREEMPT
> >>> [   76.387722] Tainted: [O]=OOT_MODULE, [L]=SOFTLOCKUP
> >>> [   76.387722] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> >>> [   76.387722] RIP: 0010:_raw_spin_lock_irqsave+0xc/0x40
> >>> [   76.387722] Call Trace:
> >>> [   76.387722]  <IRQ>
> >>> [   76.387722]  hrtimer_try_to_cancel.part.0+0x24/0xe0
> >>> [   76.387722]  hrtimer_cancel+0x21/0x40
> >>> [   76.387722]  cpu_clock_event_stop+0x64/0x70
> >>
> >>
> >> The issues should be fixed by the patch.
> >> https://lore.kernel.org/lkml/20250528175832.2999139-1-kan.liang@linux.intel.com/
> >>
> >> Could you please give it a try?
> >
> > Thanks. It fixes it, but the commit log says that
> > only cpu-clock and task_clock are affected,
> > which are SW events.
>
> Yes, only the two SW events are affected.
>
> >
> > While our tests are locking while setting up:
> >
> >         struct perf_event_attr attr = {
> >                 .freq = 1,
> >                 .type = PERF_TYPE_HARDWARE,
> >                 .config = PERF_COUNT_HW_CPU_CYCLES,
> >         };
> >
> > Is it because we run in x86 VM and HW_CPU_CYCLES is mapped
> > to cpu-clock sw ?
>
> No, that's from different PMU. We never map HW_CPU_CYCLES to a SW event.
> It will error our if the PMU is not available.
>
> I'm not familiar with your test case and env. At least, I saw
> PERF_COUNT_SW_CPU_CLOCK is used in the case unpriv_bpf_disabled.

I see. The first test was necessary to create throttle conditions
for the 2nd test that actually used cpu-clock.

Feel free to add
Tested-by: Alexei Starovoitov <ast@...nel.org>

I've applied your patch to bpf tree for now to stop the bleeding.
Will drop it when the fix gets to Linus through perf trees.