linux-kernel - Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jzmdowp4.fsf@all.your.base.are.belong.to.us>
Date: Fri, 08 Mar 2024 11:16:23 +0100
From: Björn Töpel <bjorn@...nel.org>
To: Puranjay Mohan <puranjay12@...il.com>
Cc: Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt
 <palmer@...belt.com>, Albert Ou <aou@...s.berkeley.edu>, Steven Rostedt
 <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, Mark
 Rutland <mark.rutland@....com>, Sami Tolvanen <samitolvanen@...gle.com>,
 Guo Ren <guoren@...nel.org>, Ley Foon Tan <leyfoon.tan@...rfivetech.com>,
 Deepak Gupta <debug@...osinc.com>, Sia Jee Heng
 <jeeheng.sia@...rfivetech.com>, Björn Töpel
 <bjorn@...osinc.com>, Song
 Shuai <suagrfillet@...il.com>, Clément Léger
 <cleger@...osinc.com>, Al
 Viro <viro@...iv.linux.org.uk>, Jisheng Zhang <jszhang@...nel.org>,
 linux-riscv@...ts.infradead.org, linux-kernel@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

Puranjay Mohan <puranjay12@...il.com> writes:

> Hi Björn,
>
> On Thu, Mar 7, 2024 at 8:27 PM Björn Töpel <bjorn@...nel.org> wrote:
>>
>> Puranjay!
>>
>> Puranjay Mohan <puranjay12@...il.com> writes:
>>
>> > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
>> > This allows each ftrace callsite to provide an ftrace_ops to the common
>> > ftrace trampoline, allowing each callsite to invoke distinct tracer
>> > functions without the need to fall back to list processing or to
>> > allocate custom trampolines for each callsite. This significantly speeds
>> > up cases where multiple distinct trace functions are used and callsites
>> > are mostly traced by a single tracer.
>> >
>> > The idea and most of the implementation is taken from the ARM64's
>> > implementation of the same feature. The idea is to place a pointer to
>> > the ftrace_ops as a literal at a fixed offset from the function entry
>> > point, which can be recovered by the common ftrace trampoline.
>>
>> Not really a review, but some more background; Another rationale (on-top
>> of the improved per-call performance!) for CALL_OPS was to use it to
>> build ftrace direct call support (which BPF uses a lot!). Mark, please
>> correct me if I'm lying here!
>>
>> On Arm64, CALL_OPS makes it possible to implement direct calls, while
>> only patching one BL instruction -- nice!
>>
>> On RISC-V we cannot use use the same ideas as Arm64 straight off,
>> because the range of jal (compare to BL) is simply too short (+/-1M).
>> So, on RISC-V we need to use a full auipc/jal pair (the text patching
>> story is another chapter, but let's leave that aside for now). Since we
>> have to patch multiple instructions, the cmodx situation doesn't really
>> improve with CALL_OPS.
>>
>> Let's say that we continue building on your patch and implement direct
>> calls on CALL_OPS for RISC-V as well.
>>
>> From Florent's commit message for direct calls:
>>
>>   |    There are a few cases to distinguish:
>>   |    - If a direct call ops is the only one tracing a function:
>>   |      - If the direct called trampoline is within the reach of a BL
>>   |        instruction
>>   |         -> the ftrace patchsite jumps to the trampoline
>>   |      - Else
>>   |         -> the ftrace patchsite jumps to the ftrace_caller trampoline which
>>   |            reads the ops pointer in the patchsite and jumps to the direct
>>   |            call address stored in the ops
>>   |    - Else
>>   |      -> the ftrace patchsite jumps to the ftrace_caller trampoline and its
>>   |         ops literal points to ftrace_list_ops so it iterates over all
>>   |         registered ftrace ops, including the direct call ops and calls its
>>   |         call_direct_funcs handler which stores the direct called
>>   |         trampoline's address in the ftrace_regs and the ftrace_caller
>>   |         trampoline will return to that address instead of returning to the
>>   |         traced function
>>
>> On RISC-V, where auipc/jalr is used, the direct called trampoline would
>> always be reachable, and then first Else-clause would never be entered.
>> This means the the performance for direct calls would be the same as the
>> one we have today (i.e. no regression!).
>>
>> RISC-V does like x86 does (-ish) -- patch multiple instructions, long
>> reach.
>>
>> Arm64 uses CALL_OPS and patch one instruction BL.
>>
>> Now, with this background in mind, compared to what we have today,
>> CALL_OPS would give us (again assuming we're using it for direct calls):
>>
>> * Better performance for tracer per-call (faster ops lookup) GOOD
>
> ^ this was the only motivation for me to implement this patch.
>
> I don't think implementing direct calls over call ops is fruitful for
> RISC-V because once
> the auipc/jalr can be patched atomically, the direct call trampoline
> is always reachable.
> Solving the atomic text patching problem would be fun!! I am eager to
> see how it will be
> solved.

Given the upcoming Zjid spec, we'll soon be in a much better place where
we can reason about cmodx.

>> * Larger text size (function alignment + extra nops) BAD
>> * Same direct call performance NEUTRAL
>> * Same complicated text patching required NEUTRAL
>>
>> It would be interesting to see how the per-call performance would
>> improve on x86 with CALL_OPS! ;-)
>
> If I remember from Steven's talk, x86 uses dynamically allocated trampolines
> for per callsite tracers, would CALL_OPS provide better performance than that?

Probably not, and it was really a tongue-in-cheek comment -- nothing I
encourage you to do!

Now, I think a better approach for RISC-V would be implementing what x86
has (arch_ftrace_update_trampoline()), rather than CALL_OPS for RISC-V.

Thoughts?


Björn