linux-kernel - Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzYd5OT9COBS4va435jqMzkjvvAHbe55AR6giv8pitUvAg@mail.gmail.com>
Date: Thu, 31 Oct 2024 14:22:48 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: x86@...nel.org, Peter Zijlstra <peterz@...radead.org>, 
	Steven Rostedt <rostedt@...dmis.org>, Ingo Molnar <mingo@...nel.org>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, linux-kernel@...r.kernel.org, 
	Indu Bhagat <indu.bhagat@...cle.com>, Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>, 
	Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org, 
	Mark Brown <broonie@...nel.org>, linux-toolchains@...r.kernel.org, 
	Jordan Rome <jordalgo@...a.com>, Sam James <sam@...too.org>, linux-trace-kernel@...r.kerne.org, 
	Jens Remus <jremus@...ux.ibm.com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, 
	Florian Weimer <fweimer@...hat.com>, Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API

On Tue, Oct 29, 2024 at 11:10 PM Josh Poimboeuf <jpoimboe@...nel.org> wrote:
>
> On Tue, Oct 29, 2024 at 04:32:46PM -0700, Andrii Nakryiko wrote:
> > >  struct audit_context;
> > > @@ -1592,6 +1593,10 @@ struct task_struct {
> > >         struct user_event_mm            *user_event_mm;
> > >  #endif
> > >
> > > +#ifdef CONFIG_UNWIND_USER
> > > +       struct unwind_task_info         unwind_task_info;
> >
> > this is quite a lot of memory to pay on each task, a lot of which a)
> > might not have sframe
>
> Frame pointers are also supported.
>
> > and b) might not need stack unwinding during their lifetime.
>
> Right, I'm not super happy about that.
>
> > It can be a pointer and allocated in copy_process(), no?
> > Though ideally this should be initialized lazily, if possible.
>
> Problem is, the unwinder doesn't know in advance which tasks will be
> unwound.
>
> Its first clue is unwind_user_register(), would it make sense for the
> caller to clarify whether all tasks need to be unwound or only a
> specific subset?
>
> Its second clue is unwind_user_deferred(), which is called for the task
> itself.  But by then it's too late because it needs to access the
> per-task data from (potentially) irq context so it can't do a lazy
> allocation.
>
> I'm definitely open to ideas...

The laziest thing would be to perform GFP_ATOMIC allocation, and if
that fails, oops, too bad, no stack trace for you (but, generally
speaking, no big deal). Advantages are clear, though, right? Single
pointer in task_struct, which most of the time will be NULL, so no
unnecessary overheads.

>
> > > +       if (!info->entries) {
> > > +               info->entries = kmalloc(UNWIND_MAX_ENTRIES * sizeof(long),
> > > +                                       GFP_KERNEL);
> > > +               if (!info->entries)
> > > +                       return;
> >
> > uhm... can we notify callbacks that stack capture failed? otherwise
> > we'd need some extra timeouts and other complications if we are
> > *waiting* for this callback to be called
>
> Hm, do you actually plan to wait for the callback?
>
> I assume this is BPF, can you give some high-level detail about how it
> will use these interfaces?

So I have a presentation about integrating async stack trace capture
into BPF at last LSF/MM/BPF, LWN.net has an article ([0]), which you
might want to skim through, not sure if that will be helpful.

But very briefly, the way I see it, we'll have a workflow similar to
the following:

1. BPF program will call some new API to request stack trace capture:
bpf_get_stack_async(...)
  - it will pass a reference to BPF ringbuf into which captured stack
trace should go
  - this API will return unique ID (similar to your cookie, but I'll
need a unique cookie for each call to bpf_get_stack_async() call, even
if all of them capture the same stack trace; so what Mathie is
proposing with coalescing all requests and triggering one callback
isn't very convenient in this case, but we can probably work around
that in BPF-specific way)
2. bpf_get_stack_async() calls unwind_user_deferred() and it expects
one of two outcomes:
  - either successful capture, at which point we'll submit data into
BPF ringbuf with that unique ID for user to correlate whatever data
was captured synchronously and stack trace that arrived later
  - OR we failed to get a stack trace, and we'll still put a small
piece of information for user to know that stack trace is never
coming.

It's the last point that's important to make usability so much
simpler, avoiding unnecessary custom timeouts and stuff like that.
Regardless whether stack trace capture is success or not, user is
guaranteed to get a "notification" about the outcome.

Hope this helps.

But basically, if I I called unwind_user_deferred(), I expect to get
some callback, guaranteed, with the result or failure. The only thing
that's not guaranteed (and which makes timeouts bad) is *when* this
will happen. Because stack trace capture can be arbitrarily delayed
and stuff. That's fine, but that also shows why timeout is tricky and
necessarily fragile.


  [0] https://lwn.net/Articles/978736/

>
> --
> Josh