[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <74afc8d4-621a-4876-a8cf-6165a913e4b3@efficios.com>
Date: Wed, 30 Oct 2024 09:44:14 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
Steven Rostedt <rostedt@...dmis.org>, Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>, linux-kernel@...r.kernel.org,
Indu Bhagat <indu.bhagat@...cle.com>, Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
Ian Rogers <irogers@...gle.com>, Adrian Hunter <adrian.hunter@...el.com>,
linux-perf-users@...r.kernel.org, Mark Brown <broonie@...nel.org>,
linux-toolchains@...r.kernel.org, Jordan Rome <jordalgo@...a.com>,
Sam James <sam@...too.org>, linux-trace-kernel@...r.kernel.org,
Andrii Nakryiko <andrii.nakryiko@...il.com>,
Jens Remus <jremus@...ux.ibm.com>, Florian Weimer <fweimer@...hat.com>,
Andy Lutomirski <luto@...nel.org>
Subject: Re: [PATCH v3 11/19] unwind: Add deferred user space unwinding API
On 2024-10-29 14:34, Josh Poimboeuf wrote:
> On Tue, Oct 29, 2024 at 01:47:59PM -0400, Mathieu Desnoyers wrote:
>>>> If ftrace needs brain damage like this, can't we push this to the user?
>>>>
>>>> That is, do away with the per-cpu sequence crap, and add a per-task
>>>> counter that is incremented for every return-to-userspace.
>>>
>>> That would definitely make things easier for me, though IIRC Steven and
>>> Mathieu had some concerns about TID wrapping over days/months/years.
>>>
>>> With that mindset I suppose the per-CPU counter could also wrap, though
>>> that could be mitigated by making the cookie a struct with more bits.
>>>
>>
>> AFAIR, the scheme we discussed in Prague was different than the
>> implementation here.
>
> It does differ a bit. I'll explain why below.
>
>> We discussed having a free-running counter per-cpu, and combining it
>> with the cpu number as top (or low) bits, to effectively make a 64-bit
>> value that is unique across the entire system, but without requiring a
>> global counter with its associated cache line bouncing.
>>
>> Here is part where the implementation here differs from our discussion:
>> I recall we discussed keeping a snapshot of the counter value within
>> the task struct of the thread. So we only snapshot the per-cpu value
>> on first use after entering the kernel, and after that we use the same
>> per-cpu value snapshot (from task struct) up until return to userspace.
>> We clear the task struct cookie snapshot on return to userspace.
>
> Right, so adding some details to this, what I remember specifically
> deciding on:
>
> - In unwind_user_deferred(), if task cookie is 0, it snapshots the
> per-cpu counter, stores the old value in the task cookie, and
> increments the new value (with CPU # in top 48 bits).
>
> - Future calls to unwind_user_deferred() see the task cookie is set and
> reuse the same cookie.
>
> - Then the task work (which runs right before exiting the kernel)
> clears the task cookie (with interrupts disabled), does the unwind,
> and calls the callbacks. It clears the task cookie so that any
> future calls to unwind_user_deferred() (either before exiting the
> kernel or after next entry) are guaranteed to get an unwind.
This is where I think we should improve the logic.
If an unwind_user_deferred() is called right over/after the unwind,
I don't think we want to issue another stack walk: it would be redundant
with the one already in progress or completed.
What we'd want is to make sure that the cookie returned to that
unwind_user_deferred() is the same as the cookie associated with the
in-progress or completed stack unwind. This way, trace analysis can
look at the surrounding events (before and after) the unwind request
and find the associated call stack.
>
> That's what I had implemented originally. It had a major performance
> issue, particularly for long stacks (bash sometimes had 300+ stack
> frames in my testing).
That's probably because long stacks require a lot of work (user pages
accesses) to be done when stack walking, which increases the likeliness
of re-issuing unwind_user_deferred() while the stack walk is being done.
That's basically a lack-of-progress issue: a sufficiently long stack
walk with a sufficiently frequent unwind_user_deferred() trigger could
make the system stall forever trying to service stalk walks again
and again. That's bad.
>
> The task work runs with interrupts enabled. So if another PMU interrupt
> and call to unwind_user_deferred() happens after the task work clears
> the task cookie but before kernel exit, a new cookie is generated and an
> additional task work is scheduled. For long stacks, performance gets
> really bad, dominated by all the extra unnecessary unwinding.
What you want here is to move the point where you clear the task
cookie to _after_ completion of stack unwind. There are a few ways
this can be done:
A) Clear the task cookie in the task_work() right after the
unwind_user_deferred() is completed. Downside: if some long task work
happen to be done after the stack walk, a new unwind_user_deferred()
could be issued again and we may end up looping forever taking stack
unwind and never actually making forward progress.
B) Clear the task cookie after the exit_to_user_mode_loop is done,
before returning to user-space, while interrupts are disabled.
C) Clear the task cookie when entering kernel space again.
I think (B) and (C) could be interesting approaches, perhaps with
(B) slightly more interesting because it cleans up after itself
before returning to userspace. Also (B) allows us to return to a
purely lazy scheme where there is nothing to do when entering the
kernel. This is therefore more efficient in terms of cache locality,
because we can expect our per-task state to be cache hot when
returning to userspace, but not necessarily when entering into
the kernel.
>
> So I changed the design a bit:
>
> - Increment a per-cpu counter at kernel entry before interrupts are
> enabled.
>
> - In unwind_user_deferred(), if task cookie is 0, it sets the task
> cookie based on the per-cpu counter, like before. However if this
> cookie was the last one used by this callback+task, it skips the
> callback altogether.
>
> So it eliminates duplicate unwinds except for the CPU migration case.
This sounds complicated and fragile. And why would we care about
duplicated unwinds on cpu migration ?
What prevents us from moving the per-task cookie clearing to after
exit_to_user_mode_loop instead ? Then there is no need to do per-cpu
counter increment on every kernel entry and we can go back to a lazy
scheme.
>
> If I change the entry code to increment a per-task counter instead of a
> per-cpu counter then this problem goes away. I was just concerned about
> the performance impacts of doing that on every entry.
Moving from per-cpu to per-task makes this cookie task-specific and not
global anymore, I don't think we want this for a stack walking
infrastructure meant to be used by various tracers. Also a global cookie
is more robust and does not depend on guaranteeing that all the
trace data is present to guarantee current thread ID accuracy and
thus that cookies match between deferred unwind request and their
fulfillment.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists