linux-kernel - Re: [PATCH v13 10/14] unwind: Clear unwind

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250715084932.0563f532@gandalf.local.home>
Date: Tue, 15 Jul 2025 08:49:32 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Steven Rostedt <rostedt@...nel.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
 Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
 Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Namhyung Kim
 <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, Andrii
 Nakryiko <andrii@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>, "Jose
 E. Marchesi" <jemarch@....org>, Beau Belgrave <beaub@...ux.microsoft.com>,
 Jens Remus <jremus@...ux.ibm.com>, Linus Torvalds
 <torvalds@...ux-foundation.org>, Andrew Morton <akpm@...ux-foundation.org>,
 Jens Axboe <axboe@...nel.dk>, Florian Weimer <fweimer@...hat.com>, Sam
 James <sam@...too.org>
Subject: Re: [PATCH v13 10/14] unwind: Clear unwind_mask on exit back to
 user space

On Tue, 15 Jul 2025 12:29:12 +0200
Peter Zijlstra <peterz@...radead.org> wrote:

> On Mon, Jul 07, 2025 at 09:22:49PM -0400, Steven Rostedt wrote:
> > diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferred.h
> > index 12bffdb0648e..587e120c0fd6 100644
> > --- a/include/linux/unwind_deferred.h
> > +++ b/include/linux/unwind_deferred.h
> > @@ -18,6 +18,14 @@ struct unwind_work {
> >  
> >  #ifdef CONFIG_UNWIND_USER
> >  
> > +#define UNWIND_PENDING_BIT	(BITS_PER_LONG - 1)
> > +#define UNWIND_PENDING		BIT(UNWIND_PENDING_BIT)  
> 
> Since it really didn't matter what bit you took, why not take bit 0?

I was worried about it affecting the global unwind_mask test, but I guess
bit zero works too. In fact, I think I could just set the PENDING and USED
(see next patch) bits in the global unwind mask as being already "used" and
then change:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~(UNWIND_PENDING|UNWIND_USED))
		return -EBUSY;

Back to:

	/* See if there's a bit in the mask available */
	if (unwind_mask == ~0UL)
		return -EBUSY;

> 


> >  /*
> >   * This is a unique percpu identifier for a given task entry context.
> >   * Conceptually, it's incremented every time the CPU enters the kernel from
> > @@ -143,14 +148,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	struct unwind_task_info *info = container_of(head, struct unwind_task_info, work);
> >  	struct unwind_stacktrace trace;
> >  	struct unwind_work *work;
> > +	unsigned long bits;
> >  	u64 cookie;
> >  	int idx;
> >  
> > -	if (WARN_ON_ONCE(!local_read(&info->pending)))
> > +	if (WARN_ON_ONCE(!unwind_pending(info)))
> >  		return;
> >  
> > -	/* Allow work to come in again */
> > -	local_set(&info->pending, 0);
> > +	/* Clear pending bit but make sure to have the current bits */
> > +	bits = READ_ONCE(info->unwind_mask);
> > +	while (!try_cmpxchg(&info->unwind_mask, &bits, bits & ~UNWIND_PENDING))
> > +		;  
> 
> We have:
> 
> 	bits = atomic_long_fetch_andnot(UNWIND_PENDING, &info->unwind_mask);
> 
> for that. A fair number of architecture can actually do this better than
> a cmpxchg loop.

Thanks, I didn't know about that one.

> 
> >  
> >  	/*
> >  	 * From here on out, the callback must always be called, even if it's
> > @@ -166,10 +174,8 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >  	idx = srcu_read_lock(&unwind_srcu);
> >  	list_for_each_entry_srcu(work, &callbacks, list,
> >  				 srcu_read_lock_held(&unwind_srcu)) {
> > -		if (test_bit(work->bit, &info->unwind_mask)) {
> > +		if (test_bit(work->bit, &bits))
> >  			work->func(work, &trace, cookie);
> > -			clear_bit(work->bit, &info->unwind_mask);
> > -		}
> >  	}
> >  	srcu_read_unlock(&unwind_srcu, idx);
> >  }
> > @@ -194,15 +200,17 @@ static void unwind_deferred_task_work(struct callback_head *head)
> >   * because it has already been previously called for the same entry context,
> >   * it will be called again with the same stack trace and cookie.
> >   *
> > - * Return: 1 if the the callback was already queued.
> > - *         0 if the callback successfully was queued.
> > + * Return: 0 if the callback successfully was queued.
> > + *         UNWIND_ALREADY_PENDING if the the callback was already queued.
> > + *         UNWIND_ALREADY_EXECUTED if the callback was already called
> > + *                (and will not be called again)
> >   *         Negative if there's an error.
> >   *         @cookie holds the cookie of the first request by any user
> >   */  
> 
> Lots of babbling in the Changelog, but no real elucidation as to why you
> need this second return value.
> 
> AFAICT it serves no real purpose; the users of this function should not
> care. The only difference is that the unwind reference (your cookie)
> becomes a backward reference instead of a forward reference. But why
> would anybody care?

Older versions of the code required it. I think I can remove it now.

> 
> Whatever tool is ultimately in charge of gluing humpty^Wstacktraces back
> together again should have no problem with this.
> 
> >  int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  {
> >  	struct unwind_task_info *info = &current->unwind_info;
> > -	long pending;
> > +	unsigned long old, bits;
> >  	int bit;
> >  	int ret;
> >  
> > @@ -225,32 +233,52 @@ int unwind_deferred_request(struct unwind_work *work, u64 *cookie)
> >  
> >  	*cookie = get_cookie(info);
> >  
> > -	/* This is already queued */
> > -	if (test_bit(bit, &info->unwind_mask))
> > -		return 1;
> > +	old = READ_ONCE(info->unwind_mask);
> > +
> > +	/* Is this already queued */
> > +	if (test_bit(bit, &old)) {
> > +		/*
> > +		 * If pending is not set, it means this work's callback
> > +		 * was already called.
> > +		 */
> > +		return old & UNWIND_PENDING ? UNWIND_ALREADY_PENDING :
> > +			UNWIND_ALREADY_EXECUTED;
> > +	}
> >  
> > -	/* callback already pending? */
> > -	pending = local_read(&info->pending);
> > -	if (pending)
> > +	if (unwind_pending(info))
> >  		goto out;
> >  
> > +	/*
> > +	 * This is the first to enable another task_work for this task since
> > +	 * the task entered the kernel, or had already called the callbacks.
> > +	 * Set only the bit for this work and clear all others as they have
> > +	 * already had their callbacks called, and do not need to call them
> > +	 * again because of this work.
> > +	 */
> > +	bits = UNWIND_PENDING | BIT(bit);
> > +
> > +	/*
> > +	 * If the cmpxchg() fails, it means that an NMI came in and set
> > +	 * the pending bit as well as cleared the other bits. Just
> > +	 * jump to setting the bit for this work.
> > +	 */
> >  	if (CAN_USE_IN_NMI) {
> > -		/* Claim the work unless an NMI just now swooped in to do so. */
> > -		if (!local_try_cmpxchg(&info->pending, &pending, 1))
> > +		if (!try_cmpxchg(&info->unwind_mask, &old, bits))
> >  			goto out;
> >  	} else {
> > -		local_set(&info->pending, 1);
> > +		info->unwind_mask = bits;
> >  	}
> >  
> >  	/* The work has been claimed, now schedule it. */
> >  	ret = task_work_add(current, &info->work, TWA_RESUME);
> > -	if (WARN_ON_ONCE(ret)) {
> > -		local_set(&info->pending, 0);
> > -		return ret;
> > -	}
> >  
> > +	if (WARN_ON_ONCE(ret))
> > +		WRITE_ONCE(info->unwind_mask, 0);
> > +
> > +	return ret;
> >   out:
> > -	return test_and_set_bit(bit, &info->unwind_mask);
> > +	return test_and_set_bit(bit, &info->unwind_mask) ?
> > +		UNWIND_ALREADY_PENDING : 0;
> >  }  
> 
> This is some of the most horrifyingly confused code I've seen in a
> while.
> 
> Please just slow down and think for a minute.
> 
> The below is the last four patches rolled into one. Not been near a
> compiler.

Are you recommending that I fold those patches into one?

I'm fine with that. Note, part of the way things are broken up is because I
took Josh's code and built on top of his work. I tend to try not to modify
someone else's code when doing that and make building blocks of each stage.

Also, it follows the way I tend to review code. Which is to take the entire
patch set, apply it, then look at each patch compared to the final result.

That probably explains why my patch series is confusing for you, as it was
written more for the way I review. Sorry about that.

-- Steve