linux-kernel - Re: [REGRESSION] bisected: perf: hang when using async-profiler caused by perf: Fix the POLL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHPNGSTgahRhAg5eM=pnn45rw=TJzTz4AfeckcB2RcsPvxCeGg@mail.gmail.com>
Date: Sun, 12 Oct 2025 23:55:07 -0700
From: Octavia Togami <octavia.togami@...il.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Octavia Togami <octavia.togami@...il.com>, stable@...r.kernel.org, 
	regressions@...ts.linux.dev, peterz@...radead.org, 
	linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org
Subject: Re: [REGRESSION] bisected: perf: hang when using async-profiler
 caused by perf: Fix the POLL_HUP delivery breakage

That change appears to fix the problem on my end. I ran my reproducer
and some other tests multiple times without issue.

On Sun, Oct 12, 2025 at 7:34 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>
>
> On 10/11/2025 4:31 PM, Octavia Togami wrote:
> > Using async-profiler
> > (https://github.com/async-profiler/async-profiler/) on Linux
> > 6.17.1-arch1-1 causes a complete hang of the CPU. This has been
> > reported by many people at https://github.com/lucko/spark/issues/530.
> > spark is a piece of software that uses async-profiler internally.
> >
> > As seen in https://github.com/lucko/spark/issues/530#issuecomment-3339974827,
> > this was bisected to 18dbcbfabfffc4a5d3ea10290c5ad27f22b0d240 perf:
> > Fix the POLL_HUP delivery breakage. Reverting this commit on 6.17.1
> > fixed the issue for me.
> >
> > Steps to reproduce:
> > 1. Get a copy of async-profiler. I tested both v3 (affects older spark
> > versions) and v4.1 (latest at time of writing). Unarchive it, this is
> > <async-profiler-dir>.
> > 2. Set kernel parameters kernel.perf_event_paranoid=1 and
> > kernel.kptr_restrict=0 as instructed by
> > https://github.com/async-profiler/async-profiler/blob/fb673227c7fb311f872ce9566769b006b357ecbe/docs/GettingStarted.md
> > 3. Install a version of Java that comes with jshell, i.e. Java 9 or
> > newer. Note: jshell is used for ease of reproduction. Any Java
> > application that is actively running will work.
> > 4. Run `printf 'int acc; while (true) { acc++; }' | jshell -`. This
> > will start an infinitely running Java process.
> > 5. Run `jps` and take the PID next to the text RemoteExecutionControl
> > -- this is the process that was just started.
> > 6. Attach async-profiler to this process by running
> > `<async-profiler-dir>/bin/asprof -d 1 <PID>`. This will run for one
> > second, then the system should freeze entirely shortly thereafter.
> >
> > I triggered a sysrq crash while the system was frozen, and the output
> > I found in journalctl afterwards is at
> > https://gist.github.com/octylFractal/76611ee76060051e5efc0c898dd0949e
> > I'm not sure if that text is actually from the triggered crash, but it
> > seems relevant. If needed, please tell me how to get the actual crash
> > report, I'm not sure where it is.
> >
> > I'm using an AMD Ryzen 9 5900X 12-Core Processor. Given that I've seen
> > no Intel reports, it may be AMD specific. I don't have an Intel CPU on
> > hand to test with.
> >
> > /proc/version: Linux version 6.17.1-arch1-1 (linux@...hlinux) (gcc
> > (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP
> > PREEMPT_DYNAMIC Mon, 06 Oct 2025 18:48:29 +0000
> > Operating System: Arch Linux
> > uname -mi: x86_64 unknown
>
> It looks the issue described in the link
> (https://lore.kernel.org/all/20250606192546.915765-1-kan.liang@linux.intel.com/T/#u)
> happens again but in a different way. :(
>
> As the commit message above link described,  cpu-clock (and task-clock) is
> a specific SW event which rely on hrtimer. The hrtimer handler calls
> __perf_event_overflow() and then event_stop (cpu_clock_event_stop()) and
> eventually call hrtimer_cancel() which traps into a dead loop which waits
> for the calling hrtimer handler finishes.
>
> As the
> change (https://lore.kernel.org/all/20250606192546.915765-1-kan.liang@linux.intel.com/T/#u),
> it should be enough to just disable the event and don't need an extra event
> stop.
>
> @Octavia, could you please check if the change below can fix this issue?
> Thanks.
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7541f6f85fcb..883b0e1fa5d3 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -10343,7 +10343,20 @@ static int __perf_event_overflow(struct perf_event
> *event,
>                 ret = 1;
>                 event->pending_kill = POLL_HUP;
>                 perf_event_disable_inatomic(event);
> -               event->pmu->stop(event, 0);
> +
> +               /*
> +                * The cpu-clock and task-clock are two special SW events,
> +                * which rely on the hrtimer. The __perf_event_overflow()
> +                * is invoked from the hrtimer handler for these 2 events.
> +                * Avoid to call event_stop()->hrtimer_cancel() for these
> +                * 2 events since hrtimer_cancel() waits for the hrtimer
> +                * handler to finish, which would trigger a deadlock.
> +                * Only disabling the events is enough to stop the hrtimer.
> +                * See perf_swevent_cancel_hrtimer().
> +                */
> +               if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK &&
> +                   event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
> +                       event->pmu->stop(event, 0);
>         }
>
>         if (event->attr.sigtrap) {
>
>