[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250619085717.GB1613376@noisy.programming.kicks-ass.net>
Date: Thu, 19 Jun 2025 10:57:17 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
bpf@...r.kernel.org, x86@...nel.org,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Josh Poimboeuf <jpoimboe@...nel.org>,
Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Andrii Nakryiko <andrii@...nel.org>,
Indu Bhagat <indu.bhagat@...cle.com>,
"Jose E. Marchesi" <jemarch@....org>,
Beau Belgrave <beaub@...ux.microsoft.com>,
Jens Remus <jremus@...ux.ibm.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v10 07/14] unwind_user/deferred: Make unwind deferral
requests NMI-safe
On Tue, Jun 10, 2025 at 08:54:28PM -0400, Steven Rostedt wrote:
> +static int unwind_deferred_request_nmi(struct unwind_work *work, u64 *timestamp)
> +{
> + struct unwind_task_info *info = ¤t->unwind_info;
> + bool inited_timestamp = false;
> + int ret;
> +
> + /* Always use the nmi_timestamp first */
> + *timestamp = info->nmi_timestamp ? : info->timestamp;
> +
> + if (!*timestamp) {
> + /*
> + * This is the first unwind request since the most recent entry
> + * from user space. Initialize the task timestamp.
> + *
> + * Don't write to info->timestamp directly, otherwise it may race
> + * with an interruption of get_timestamp().
> + */
> + info->nmi_timestamp = local_clock();
> + *timestamp = info->nmi_timestamp;
> + inited_timestamp = true;
> + }
> +
> + if (info->pending)
> + return 1;
> +
> + ret = task_work_add(current, &info->work, TWA_NMI_CURRENT);
> + if (ret < 0) {
> + /*
> + * If this set nmi_timestamp and is not using it,
> + * there's no guarantee that it will be used.
> + * Set it back to zero.
> + */
> + if (inited_timestamp)
> + info->nmi_timestamp = 0;
> + return ret;
> + }
> +
> + info->pending = 1;
> +
> + return 0;
> +}
> +
> /**
> * unwind_deferred_request - Request a user stacktrace on task exit
> * @work: Unwind descriptor requesting the trace
> @@ -139,31 +207,38 @@ static void unwind_deferred_task_work(struct callback_head *head)
> int unwind_deferred_request(struct unwind_work *work, u64 *timestamp)
> {
> struct unwind_task_info *info = ¤t->unwind_info;
> + int pending;
> int ret;
>
> *timestamp = 0;
>
> if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
> !user_mode(task_pt_regs(current)))
> return -EINVAL;
>
> + if (in_nmi())
> + return unwind_deferred_request_nmi(work, timestamp);
So nested NMI is a thing -- AFAICT this is broken in the face of nested
NMI.
Specifically, we mark all exceptions that can happen with IRQs disabled
as NMI like (so that they don't go about taking locks etc.).
So imagine you're in #DB, you're asking for an unwind, you do the above
dance and get hit with NMI.
Then you get the NMI setting nmi_timestamp, and #DB overwriting it with
a later value, and you're back up the creek without no paddles.
Mix that with local_clock() that is only monotonic on a single CPU. And
you ask for an unwind on CPU0, get migrated to CPU1 which for the
argument will be behind, and see a timestamp 'far' in the future.
Powered by blists - more mailing lists