linux-kernel - Re: [PATCH v13 10/14] unwind: Clear unwind

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250716142609.47f0e4a5@batman.local.home>
Date: Wed, 16 Jul 2025 14:26:09 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Steven Rostedt <rostedt@...nel.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
 Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
 Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Namhyung Kim
 <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, Andrii
 Nakryiko <andrii@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>, "Jose
 E. Marchesi" <jemarch@....org>, Beau Belgrave <beaub@...ux.microsoft.com>,
 Jens Remus <jremus@...ux.ibm.com>, Linus Torvalds
 <torvalds@...ux-foundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
 Jens Axboe <axboe@...nel.dk>, Florian Weimer <fweimer@...hat.com>, Sam
 James <sam@...too.org>
Subject: Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to
 user space

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@...radead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

The above is still needed as is (explained below).


> @@ -170,41 +193,62 @@ static void unwind_deferred_task_work(st
>  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
>  {
>  	struct unwind_task_info *info = &current->unwind_info;
> -	int ret;
> +	unsigned long bits, mask;
> +	int bit, ret;
>  
>  	*cookie = 0;
>  
> -	if (WARN_ON_ONCE(in_nmi()))
> -		return -EINVAL;
> -
>  	if ((current->flags & (PF_KTHREAD | PF_EXITING)) ||
>  	    !user_mode(task_pt_regs(current)))
>  		return -EINVAL;
>  
> +	/* NMI requires having safe cmpxchg operations */
> +	if (WARN_ON_ONCE(!UNWIND_NMI_SAFE && in_nmi()))
> +		return -EINVAL;
> +
> +	/* Do not allow cancelled works to request again */
> +	bit = READ_ONCE(work->bit);
> +	if (WARN_ON_ONCE(bit < 0))
> +		return -EINVAL;
> +
>  	guard(irqsave)();
>  
>  	*cookie = get_cookie(info);
>  
> -	/* callback already pending? */
> -	if (info->pending)
> +	bits = UNWIND_PENDING | BIT(bit);
> +	mask = atomic_long_fetch_or(bits, &info->unwind_mask);
> +	if (mask & bits)
>  		return 1;


So the fetch_or() isn't good enough for what needs to be done, and why
the code above is the way it is.

We have this scenario:


  perf and ftrace are both tracing the same task. perf with bit 1 and
  ftrace with bit 2. Let's say there's even another perf program
  running and registered bit 3.


  perf requests a deferred callback, and info->unwind_mask gets bit 1
  and the pending bit set.

  The task is exiting to user space and calls perf's callback and
  clears the pending bit but keeps perf's bit set as it was already
  called, and doesn't need to be called again even if perf requests a
  new stacktrace before the task gets back to user space.

  Now before the task gets back to user space, ftrace requests the
  deferred trace. To do so, it must set the pending bit and its bit,
  but it must also clear the perf bit as it should not call perf's
  callback again.

The atomic_long_fetch_or() above will set ftrace's bit but not clear
perf's bits and the perf callback will get called a second time even
though perf never requested another callback.

This is why the code at the top has:

	bits = UNWIND_PENDING | BIT(bit);

	/*
	 * If the cmpxchg() fails, it means that an NMI came in and set
	 * the pending bit as well as cleared the other bits. Just
	 * jump to setting the bit for this work.
	 */
	if (CAN_USE_IN_NMI) {
		/* Claim the work unless an NMI just now swooped in to do so. */
		if (!local_try_cmpxchg(&info->pending, &pending, 1))
		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
			goto out;

That cmpxchg() clears out any of the old bits if pending isn't set. Now
if an NMI came in and the other perf process requested a callback, it
would set its own bit plus the pending bit and then ftrace only needs
to jump to the end and do the test_and_set on its bit.

-- Steve


>  
>  	/* The work has been claimed, now schedule it. */
>  	ret = task_work_add(current, &info->work, TWA_RESUME);
>  	if (WARN_ON_ONCE(ret))
> -		return ret;
> -
> -	info->pending = 1;
> -	return 0;
> +		atomic_long_set(0, &info->unwind_mask);
>  }
>