linux-kernel - Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250702145713.57062487@batman.local.home>
Date: Wed, 2 Jul 2025 14:57:13 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, bpf@...r.kernel.org, x86@...nel.org,
 Masami Hiramatsu <mhiramat@...nel.org>, Mathieu Desnoyers
 <mathieu.desnoyers@...icios.com>, Josh Poimboeuf <jpoimboe@...nel.org>,
 Ingo Molnar <mingo@...nel.org>, Jiri Olsa <jolsa@...nel.org>, Namhyung Kim
 <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>, Andrii
 Nakryiko <andrii@...nel.org>, Indu Bhagat <indu.bhagat@...cle.com>, "Jose
 E. Marchesi" <jemarch@....org>, Beau Belgrave <beaub@...ux.microsoft.com>,
 Jens Remus <jremus@...ux.ibm.com>, Andrew Morton
 <akpm@...ux-foundation.org>, Jens Axboe <axboe@...nel.dk>, Florian Weimer
 <fweimer@...hat.com>
Subject: Re: [PATCH v12 06/14] unwind_user/deferred: Add deferred unwinding
 interface

On Wed, 2 Jul 2025 11:21:34 -0700
Linus Torvalds <torvalds@...ux-foundation.org> wrote:

> On Wed, 2 Jul 2025 at 10:49, Steven Rostedt <rostedt@...dmis.org> wrote:
> >
> > To still be able to use a 32 bit cmpxchg (for racing with an NMI), we
> > could break this number up into two 32 bit words. One with the CPU that
> > it was created on, and one with the per_cpu counter:  
> 
> Do you actually even need a cpu number at all?
> 
> If this is per-thread, maybe just a per-thread counter would be good?
> And you already *have* that per-thread thing, in
> 'current->unwind_info'.

I just hate to increase the task struct even more. I'm trying to keep
only the data I can't put elsewhere in the task struct.

> 
> And the work is using task_work, so the worker callback is also per-thread.
> 
> Also, is racing with NMI even a thing for the sequence number? I would
> expect that the only thing that NMI would race with is the 'pending'
> field, not the sequence number.

The sequence number is updated on the first time it's called after a
task enters the kernel. Then it should return the same number every
time after that. That's because this unique number is the identifier
for the user space stack trace which doesn't change while the task is
in the kernel, hence the id getting updated every time the task enters
the kernel and not after that.

> 
> IOW, I think the logic could be
> 
>  - check 'pending' non-atomically, just because it's cheap

Note, later patches move the "pending" bit into a mask that is used to
figure out what callbacks to call.

> 
>  - do a try_cmpxchg() on pending to actually deal with nmi races
> 
> Actually, there are no SMP issues, just instruction atomicity - so a
> 'local_try_cmpxchg() would be sufficient, but that's a 'long' not a
> 'u32' ;^(

Yeah, later patches do try to use more local_try_cmpxchg() at different
parts. Even for the timestamp.

> 
>  - now you are exclusive for that thread, you're done, no more need
> for any atomic counter or percpu things

The trick and race with NMIs is, this needs to return that cookie for
both callers, where the cookie is the same number.

> 
> And then the next step is to just say "pending is the low bit of the
> id word" and having a separate 31-bit counter that gets incremented by
> "get_cookie()".
> 
> So then you end up with something like
> 
>   // New name for 'get_timestamp()'
>   get_current_cookie() { return current->unwind_info.cookie; }
>   // New name for 'get_cookie()':
>   // 31-bit counter by just leaving bit #0 alone
>   get_new_cookie() { current->unwind_info.cookie += 2; }
> 
> and then unwind_deferred_request() would do something like
> 
>   unwind_deferred_request()
>   {
>         int old, new;
> 
>         if (current->unwind_info.id)
>                 return 1;
> 
>         guard(irqsave)();
>         // For NMI, if we race with 'get_new_cookie()'
>         // we don't care if we get the old or the new one
>         old = 0; new = get_current_cookie() | 1;
>         if (!try_cmpxchg(&current->unwind_info.id, &old, new))
>                 return 1;
>         .. now schedule the thing with that cookie set.
> 
> Hmm?
> 
> But I didn't actually look at the users, so maybe I'm missing some
> reason why you want to have a separate per-cpu value.

Now we could just have the counter be the 32 bit cookie (old timestamp)
in the task struct, and have bit 0 be if it is valued or not.

static u32 get_cookie(struct unwind_task_info *info)
{
	u32 cnt = READ_ONCE(info->id);
	u32 new_cnt;

	if (cnt & 1)
		return cnt;

	new_cnt = cnt + 3;
	if (try_cmpxchg(&info->id, &cnt, new_cnt))
		return new_cnt;

	return cnt; // return info->id; would work too
}

Then when going back to user space:

	info->id &= ~1;

Now the counter is stored in the task struct and no other info is needed.

> 
> Or maybe I missed something else entirely, and the above is complete
> garbage and the ramblings of a insane mind.
> 
> It happens.
> 
> Off to get more coffee.

Enjoy, I just grabbed my last cup of the day.

-- Steve