[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YmGF/OpIhAF8YeVq@lakrids>
Date: Thu, 21 Apr 2022 17:27:40 +0100
From: Mark Rutland <mark.rutland@....com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Wang ShaoBo <bobo.shaobowang@...wei.com>, cj.chengjian@...wei.com,
huawei.libin@...wei.com, xiexiuqi@...wei.com, liwei391@...wei.com,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
catalin.marinas@....com, will@...nel.org, zengshun.wu@...look.com
Subject: Re: [RFC PATCH -next v2 3/4] arm64/ftrace: support dynamically
allocated trampolines
On Thu, Apr 21, 2022 at 11:42:01AM -0400, Steven Rostedt wrote:
> On Thu, 21 Apr 2022 16:14:13 +0100
> Mark Rutland <mark.rutland@....com> wrote:
>
> > > Let's say you have 10 ftrace_ops registered (with bpf and kprobes this can
> > > be quite common). But each of these ftrace_ops traces a function (or
> > > functions) that are not being traced by the other ftrace_ops. That is, each
> > > ftrace_ops has its own unique function(s) that they are tracing. One could
> > > be tracing schedule, the other could be tracing ksoftirqd_should_run
> > > (whatever).
> >
> > Ok, so that's when messing around with bpf or kprobes, and not generally
> > when using plain old ftrace functionality under /sys/kernel/tracing/
> > (unless that's concurrent with one of the former, as per your other
> > reply) ?
>
> It's any user of the ftrace infrastructure, which includes kprobes, bpf,
> perf, function tracing, function graph tracing, and also affects instances.
>
> >
> > > Without this change, because the arch does not support dynamically
> > > allocated trampolines, it means that all these ftrace_ops will be
> > > registered to the same trampoline. That means, for every function that is
> > > traced, it will loop through all 10 of theses ftrace_ops and check their
> > > hashes to see if their callback should be called or not.
> >
> > Sure; I can see how that can be quite expensive.
> >
> > What I'm trying to figure out is who this matters to and when, since the
> > implementation is going to come with a bunch of subtle/fractal
> > complexities, and likely a substantial overhead too when enabling or
> > disabling tracing of a patch-site. I'd like to understand the trade-offs
> > better.
> >
> > > With dynamically allocated trampolines, each ftrace_ops will have their own
> > > trampoline, and that trampoline will be called directly if the function
> > > is only being traced by the one ftrace_ops. This is much more efficient.
> > >
> > > If a function is traced by more than one ftrace_ops, then it falls back to
> > > the loop.
> >
> > I see -- so the dynamic trampoline is just to get the ops? Or is that
> > doing additional things?
>
> It's to get both the ftrace_ops (as that's one of the parameters) as well
> as to call the callback directly. Not sure if arm is affected by spectre,
> but the "loop" function is filled with indirect function calls, where as
> the dynamic trampolines call the callback directly.
>
> Instead of:
>
> bl ftrace_caller
>
> ftrace_caller:
> [..]
> bl ftrace_ops_list_func
> [..]
>
>
> void ftrace_ops_list_func(...)
> {
> __do_for_each_ftrace_ops(op, ftrace_ops_list) {
> if (ftrace_ops_test(op, ip)) // test the hash to see if it
> // should trace this
> // function.
> op->func(...);
> }
> }
>
> It does:
>
> bl dyanmic_tramp
>
> dynamic_tramp:
> [..]
> bl func // call the op->func directly!
>
>
> Much more efficient!
>
>
> >
> > There might be a middle-ground here where we patch the ftrace_ops
> > pointer into a literal pool at the patch-site, which would allow us to
> > handle this atomically, and would avoid the issues with out-of-range
> > trampolines.
>
> Have an example of what you are suggesting?
We can make the compiler to place 2 NOPs before the function entry point, and 2
NOPs after it using `-fpatchable-function-entry=4,2` (the arguments are
<total>,<before>). On arm64 all instructions are 4 bytes, and we'll use the
first two NOPs as an 8-byte literal pool.
Ignoring BTI for now, the compiler generates (with some magic labels added here
for demonstration):
__before_func:
NOP
NOP
func:
NOP
NOP
__remainder_of_func:
...
At ftrace_init_nop() time we patch that to:
__before_func:
// treat the 2 NOPs as an 8-byte literal-pool
.quad <default ops pointer> // see below
func:
MOV X9, X30
NOP
__remainder_of_func:
...
When enabling tracing we do
__before_func:
// patch this with the relevant ops pointer
.quad <ops pointer>
func:
MOV X9, X30
BL <trampoline> // common trampoline
__remainder_of_func:
..
The `BL <trampoline>` clobbers X30 with __remainder_of_func, so within
the trampoline we can find the ops pointer at an offset from X30. On
arm64 we can load that directly with something like:
LDR <tmp>, [X30, # -(__remainder_of_func - __before_func)]
... then load the ops->func from that and invoke it (or pass it to a
helper which does):
// Ignoring the function arguments for this demonstration
LDR <tmp2>, [<tmp>, #OPS_FUNC_OFFSET]
BLR <tmp2>
That avoids iterating over the list *without* requiring separate
trampolines, and allows us to patch the sequence without requiring
stop-the-world logic (since arm64 has strong requirements for patching
most instructions other than branches and nops).
We can initialize the ops pointer to a default ops that does the whole
__do_for_each_ftrace_ops() dance.
To handle BTI we can have two trampolines, or we can always reserve 3 NOPs
before the function so that we can have a consistent offset regardless.
Thanks,
Mark.
Powered by blists - more mailing lists