[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5477E82A.3020208@hitachi.com>
Date: Fri, 28 Nov 2014 12:12:42 +0900
From: Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>
To: "Jon Medhurst (Tixy)" <tixy@...aro.org>
Cc: Wang Nan <wangnan0@...wei.com>, linux@....linux.org.uk,
will.deacon@....com, taras.kondratiuk@...aro.org,
ben.dooks@...ethink.co.uk, cl@...ux.com, rabin@....in,
davem@...emloft.net, lizefan@...wei.com,
linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: Re: [PATCH v10 2/2] ARM: kprobes: enable OPTPROBES for ARM
32
(2014/11/27 23:36), Jon Medhurst (Tixy) wrote:
> On Fri, 2014-11-21 at 14:35 +0800, Wang Nan wrote:
>> This patch introduce kprobeopt for ARM 32.
>
> If I've understood things correctly, this is a feature which inserts
> probes by using a branch instruction to some trampoline code rather than
> using an undefined instruction as a breakpoint. That way we avoid the
> overhead of processing the exception and it is this performance
> improvement which is the main/only reason for implementing it?
>
> If so, I though it good to see what kind of improvement we get by
> running the micro benchmarks in the kprobes test code. On an A7/A15
> big.LITTLE vexpress board the approximate figures I get are 0.3us for
> optimised probe, 1us for un-optimised, so a three times performance
> improvement. This is with an empty probe pre-handler and no post
> handler, so with a more realistic usecase, the relative improvement we
> get from optimisation would be less.
Indeed, I think we'd better use ftrace to measure performance, since
it is the most realistic usecase. On x86, we have similar number,
and ftrace itself has 0.3-0.4us to record an event. So I guess
it can get 2 times faster. (Of course it depends on the SoC because
memory bandwidth is the key for performance of event recording)
> I thought it good to see what sort of benefits this code achieves,
> especially as it could grow quite complex over time, and the cost of
> that versus the benefit should be considered.
I don't think it's so complex. It's actually cleanly separated.
However, ARM tree should have arch/arm/kernel/kprobe/ dir,
since there are too many kprobe related files under arch/arm/kernel/ ...
>>
>> Limitations:
>> - Currently only kernel compiled with ARM ISA is supported.
>
> Supporting Thumb will be very difficult because I don't believe that
> putting a branch into an IT block could be made to work, and you can't
> feasibly know if an instruction is in an IT block other than by first
> using something like the breakpoint probe method and then when that is
> hit examine the IT flags to see if they're set. If they aren't you could
> then change the probe to an optimised probe. Is transforming the probe
> type like that currently supported by the generic kprobes code?
Optprobe framework optimizes probes transparently. If it can not be
optimized, it just do nothing on it.
> Also, the Thumb branch instruction can only jump half as far as the ARM
> mode one. And being 32-bits when a lot of instructions people will want
> to probe are 16-bits will be an additional problem, similar as
> identified below for ARM instructions...
>
>
>>
>> - Offset between probe point and optinsn slot must not larger than
>> 32MiB.
>
>
> I see that elsewhere [1] people are working on supporting loading kernel
> modules at locations that are out of the range of a branch instruction,
> I guess because with multi-platform kernels and general code bloat
> kernels are getting too big. The same reasons would impact the usability
> of optimized kprobes as well if they're restricted to the range of a
> single branch instruction.
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-November/305539.html
>
>
>> Masami Hiramatsu suggests replacing 2 words, it will make
>> things complex. Futher patch can make such optimization.
>
> I'm wondering how can we replace 2 words if we can't determine if the
> second word is the target of a branch instruction?
on X86, we already have an instruction decoder for finding the
branch target :). But yes, it can be impossible in other arch if
it intensively uses indirect branch.
> E.g. if we had
>
> b after_probe
> ...
> probe_me: mov r2, #0
> after_probe: ldr r0, [r1]
>
> and we inserted a two word probe at probe_me, then the branch to
> after_probe would be to the second half of that 2 word probe. Guess that
> could be worked around by ensuring the 2nd word is an invalid
> instruction and trapping that case then emulating after_probe like we do
> unoptimised probes. This assumes that we can come up with an
> encoding for a 2 word 'long branch' that was suitable. (For Thumb, I
> suspect that we would need at least 3 16-bit instructions to achieve
> that).
>
> As the commit message says "will make things complex" and I begin to
> wonder if the extra complexity would be worth the benefits. (Considering
> that the resulting optimised probe would only be around twice as fast.)
>
>
>>
>> Kprobe opt on ARM is relatively simpler than kprobe opt on x86 because
>> ARM instruction is always 4 bytes aligned and 4 bytes long. This patch
>> replace probed instruction by a 'b', branch to trampoline code and then
>> calls optimized_callback(). optimized_callback() calls opt_pre_handler()
>> to execute kprobe handler. It also emulate/simulate replaced instruction.
>>
>> When unregistering kprobe, the deferred manner of unoptimizer may leave
>> branch instruction before optimizer is called. Different from x86_64,
>> which only copy the probed insn after optprobe_template_end and
>> reexecute them, this patch call singlestep to emulate/simulate the insn
>> directly. Futher patch can optimize this behavior.
>>
>> Signed-off-by: Wang Nan <wangnan0@...wei.com>
>> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>
>> Cc: Jon Medhurst (Tixy) <tixy@...aro.org>
>> Cc: Russell King - ARM Linux <linux@....linux.org.uk>
>> Cc: Will Deacon <will.deacon@....com>
>>
>> ---
>
> I initially had some trouble testing this. I tried running the kprobes
> test code with some printf's added to the code and it seems that only
> very rarely are optimised probes actually executed. This turned out to
> be due to the optimization being run as a background task after a delay.
> So I ended up hacking kernel/kprobes.c to force some calls to
> wait_for_kprobe_optimizer(). It would be nice to have the test code to
> robustly cover both optimised and unoptimised cases but that would need
> some new exported functions from the generic kprobes code, not sure what
> people think of that idea?
Hm, did you use ftrace's kprobe events?
You can actually add kprobes via /sys/kernel/debug/tracing/kprobe_events and
see what kprobes are optimized via /sys/kernel/debug/kprobes/list.
For more information, please refer
Documentation/trace/kprobetrace.txt
Documentation/kprobes.txt
Thank you,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@...achi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists