linux-kernel - Re: [PATCH v3 0/7] riscv: ftrace: atmoic patching and preempt improvements

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFTtA3NjHibAJcdfLRD2ybf=Lb_D7OtR0V_S+3TVPVH_543cdQ@mail.gmail.com>
Date: Mon, 6 Jan 2025 23:22:34 +0800
From: Andy Chiu <andybnac@...il.com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Björn Töpel <bjorn@...nel.org>, 
	Paul Walmsley <paul.walmsley@...ive.com>, Palmer Dabbelt <palmer@...belt.com>, 
	Albert Ou <aou@...s.berkeley.edu>, Nathan Chancellor <nathan@...nel.org>, 
	Nick Desaulniers <ndesaulniers@...gle.com>, Bill Wendling <morbo@...gle.com>, 
	Justin Stitt <justinstitt@...gle.com>, linux-kernel@...r.kernel.org, 
	linux-riscv@...ts.infradead.org, llvm@...ts.linux.dev, bjorn@...osinc.com, 
	puranjay12@...il.com, alexghiti@...osinc.com, yongxuan.wang@...ive.com, 
	greentime.hu@...ive.com, nick.hu@...ive.com, nylon.chen@...ive.com, 
	tommy.wu@...ive.com, eric.lin@...ive.com, viccent.chen@...ive.com, 
	zong.li@...ive.com, samuel.holland@...ive.com
Subject: Re: [PATCH v3 0/7] riscv: ftrace: atmoic patching and preempt improvements

Andy Chiu <andybnac@...il.com> 於 2024年12月30日 週一 上午3:08寫道：
>
> Steven Rostedt <rostedt@...dmis.org> 於 2024年12月24日 週二 上午11:15寫道：
> >
> > On Wed, 27 Nov 2024 22:25:57 +0100
> > Björn Töpel <bjorn@...nel.org> wrote:
> >
> > > Adding Steven.
> >
> > And this has been in my draft folder for almost a month :-p
> >
> > I kept coming to this email, but then got distracted by something else.
> >
> > >
> > > Andy Chiu <andybnac@...il.com> writes:
> > >
> > > > This series makes atmoic code patching possible in riscv ftrace. A
> > > > direct benefit of this is that we can get rid of stop_machine() when
> > > > patching function entries. This also makes it possible to run ftrace
> > > > with full kernel preemption. Before this series, the kernel initializes
> > > > patchable function entries to NOP4 + NOP4. To start tracing, it updates
> > > > entries to AUIPC + JALR while holding other cores in stop_machine.
> > > > stop_machine() is required because it is impossible to update 2
> > > > instructions, and be seen atomically. And preemption must have to be
> > > > prevented, as kernel preemption allows process to be scheduled out while
> > > > executing on one of these instruction pairs.
> > > >
> > > > This series addresses the problem by initializing the first NOP4 to
> > > > AUIPC. So, atmoic patching is possible because the kernel only has to
> > > > update one instruction. As long as the instruction is naturally aligned,
> > > > then it is expected to be updated atomically.
> > > >
> > > > However, the address range of the ftrace trampoline is limited to +-2K
> > > > from ftrace_caller after appplying this series. This issue is expected
> > > > to be solved by Puranjay's CALL_OPS, where it adds 8B naturally align
> > > > data in front of pacthable functions and can  use it to direct execution
> > > > out to any custom trampolines.
> > > >
> > > > The series is composed by three parts. The first part cleans up the
> > > > existing issues when the kernel is compiled with clang.The second part
> > > > modifies the ftrace code patching mechanism (2-4) as mentioned above.
> > > > Then prepare ftrace to be able to run with kernel preemption (5,6)
> > > >
> > > > An ongoing fix:
> > > >
> > > > Since there is no room for marking *kernel_text_address as notrace[1] at
> > > > source code level, there is a significant performance regression when
> > > > using function_graph with TRACE_IRQFLAGS enabled. There can be as much as
> > > > 8 graph handler being called in each function-entry. The current
> > > > workaround requires us echo "*kernel_text_address" into
> > > > set_ftrace_notrace before starting the trace. However, we observed that
> > > > the kernel still enables the patch site in some cases even with
> > > > *kernel_text_address properly added in the file While the root cause is
> > > > still under investagtion, we consider that it should not be the reason
> > > > for holding back the code patching, in order to unblock the call_ops
> > > > part.
> > >
> > > Maybe Steven knows this from the top of his head!
> > >
> > > As Andy points out, "*kernel_text_address" is used in the stack
> > > unwinding on RISC-V. So, if you do a tracing without filtering *and*
> > > TRACE_IRQFLAGS, one will drown in traces.
> >
> > I tested set_ftrace_notrace on x86 and the function graph tracer does honor
> > it. I wonder if there's a kernel
>
> After checking the log buffer, I can confirm that riscv does also
> honor set_ftrace_notrace. It does not print out "*kernel_text_address"
> in the trace buffer. Sorry for not making this clear.
>
> However, the problem is that the patch sites "*kernel_text_address"
> were enabled, allowing the code flow into ftrace and parts of fgraph.
> In particular, with TRACE_IRQFLAGS enabled, the functions are called
> extensively, causing the significant slow down.
>
> IIUC, it is reasonable to enable a patch site if there is at least one
> tracing instance that request the function. However, in another
> experiment, the patch sites of "*kernel_text_address"  are enabled
> even when all instances have them disabled. Here is a way to reproduce
> it:
>
> cd /sys/kernel/
> mount -t tracefs t tracing/
> cd tracing/
> mkdir instances/1
> mkdir instances/2
> cd instances/1/
> echo *kernel_text_address > set_ftrace_notrace
> echo *sched* > set_ftrace_filter # this is just to make the kernel
> more responsive
> cat set_ftrace_notrace
> echo function_graph > current_tracer
> cd ../2/
> echo *kernel_text_address > set_ftrace_notrace
> cat set_ftrace_notrace
> echo function_graph > current_tracer
>
> To figure out this problem, it is probably worth trying to understand
> why the patch site is enabled after the above sequence:
>
> It seems like the ftrace_ops being passed into
> __ftrace_hash_rec_update() at the start of the second tracer has an
> empty notrace hash. This leads to the eventual activation of the patch
> sites. However, graph_ops does not have the empty notrace hash when
> the user starts the first graph tracer. I am trying to understand this
> part of the code, and the relationship between graph_ops and ops in
> the fgraph_array.

Here might be the possible reason: the updated notrace hash for
Manager's ops is not reflecting the intersections of all users'
notrace hashes. According to the comment section at
ftrace_startup_subops():

         *   o If either notrace_hash is empty then the final stays empty
         *      o Otherwise, the final is an intersection between the hashes

It seems like wrong arguments were passed into the intersect_hash().
Instead of user's notrace hash, the filter hash was passed into the
function. As a result, even if the function is marked as notrace in
all ftrace instances, the notrace hash in Manager's ops would not
contain notrace function, leading to enabled call site at the
notrace'd function. If I understand correctly, we should pass notrace
hash instead of filter hash, as shown in the following modification.

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4c28dd177ca6..3a194559d220 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3515,18 +3515,17 @@ int ftrace_startup_subops(struct ftrace_ops
*ops, struct ftrace_ops *subops, int
            ftrace_hash_empty(subops->func_hash->notrace_hash)) {
                notrace_hash = EMPTY_HASH;
        } else {
-               size_bits = max(ops->func_hash->filter_hash->size_bits,
-                               subops->func_hash->filter_hash->size_bits);
+               size_bits = max(ops->func_hash->notrace_hash->size_bits,
+                               subops->func_hash->notrace_hash->size_bits);
                notrace_hash = alloc_ftrace_hash(size_bits);
                if (!notrace_hash) {
-                       free_ftrace_hash(filter_hash);
+                       free_ftrace_hash(notrace_hash);
                        return -ENOMEM;
                }

-               ret = intersect_hash(&notrace_hash, ops->func_hash->filter_hash,
-                                    subops->func_hash->filter_hash);
+               ret = intersect_hash(&notrace_hash,
ops->func_hash->notrace_hash,
+                                    subops->func_hash->notrace_hash);
                if (ret < 0) {
-                       free_ftrace_hash(filter_hash);
                        free_ftrace_hash(notrace_hash);
                        return ret;
                }

I can confirm that callsites at *kernel_text_address are no longer
patched after applying the above patch, if they are set as notrace in
each ftrace instance. I can send out a fix patch if this sounds like a
proper fix.

>
> >
> > >
> > > E.g. the ftrace selftest:
> > >   | $ ./ftracetest -vvv test.d/ftrace/fgraph-multi.tc
> > >
> > > will generate a lot of traces.
> > >
> > > Now, if we add:
> > > --8<--
> > > diff --git a/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-multi.tc b/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-multi.tc
> > > index ff88f97e41fb..4f30a4d81d99 100644
> > > --- a/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-multi.tc
> > > +++ b/tools/testing/selftests/ftrace/test.d/ftrace/fgraph-multi.tc
> > > @@ -84,6 +84,7 @@ cd $INSTANCE2
> > >  do_test '*rcu*' 'rcu'
> > >  cd $WD
> > >  cd $INSTANCE3
> > > +echo '*kernel_text_address' > set_ftrace_notrace
> > >  echo function_graph > current_tracer
> > >
> > >  sleep 1
> > > -->8--
> > >
> > > The graph tracer will not honor the "set_ftrace_notrace" in $INSTANCE3,
> > > but still enable the *kernel_text_address traces. (Note that there are
> > > no filters in the test, so *all* ftrace recs will be enabled.)
> > >
> > > Are we holding the graph tracer wrong?
> >
> > What do you get when you do:
> >
> >   # grep kernel_text_address available_filter_functions
> >
> > ?
>
> We get:
> kernel_text_address
> __kernel_text_address
>
> >
> > >
> > >
> > > Happy thanksgiving!
> > > Björn
> >
> > And Merry Christmas!
> >
> > -- Steve
> >
>
> And Happy New Year!
>
>
> Andy

Thanks,
Andy