lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8186a9c5-8a78-4c0b-a0a3-256231ef8f86@linux.dev>
Date: Fri, 24 Jan 2025 09:35:08 -0800
From: Yonghong Song <yonghong.song@...ux.dev>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: Breno Leitao <leitao@...ian.org>, Jason Xing <kerneljasonxing@...il.com>,
 Eric Dumazet <edumazet@...gle.com>, Masami Hiramatsu <mhiramat@...nel.org>,
 Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
 "David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Simon Horman <horms@...nel.org>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
 kernel-team@...a.com, Song Liu <song@...nel.org>,
 Martin KaFai Lau <martin.lau@...nel.org>
Subject: Re: [PATCH RFC net-next] trace: tcp: Add tracepoint for
 tcp_cwnd_reduction()




On 1/24/25 7:50 AM, Steven Rostedt wrote:
> On Thu, 23 Jan 2025 20:40:20 -0800
> Yonghong Song <yonghong.song@...ux.dev> wrote:
>
>> Hi, Steve,
>>
>> I did some prototype to support bpf dynamic_events tracepoint. I fixed a couple of issues
> Before I go further, I want to make sure that we have our terminology's
> inline. A dynamic event is one that is created at run time. For example, a
> kprobe event is a dynamic event. Other dynamic events are synthetic events,
> event probes, and uprobes.
>
>> but still not working. See the hacked patch:
>>
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 9ea4c404bd4e..729ea1c21c94 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -824,6 +824,15 @@ DECLARE_TRACE(sched_compute_energy_tp,
>>                    unsigned long max_util, unsigned long busy_time),
>>           TP_ARGS(p, dst_cpu, energy, max_util, busy_time));
>>                                                                                                                                             
>> +/* At /sys/kernel/debug/tracing directory, do
>> + *   echo 't sched_switch_dynamic' >> dynamic_events
>> + * before actually use this tracepoint. The tracepoint will be at
>> + *   /sys/kernel/debug/tracing/events/tracepoints/sched_switch_dynamic/
>> + */
>> +DECLARE_TRACE(sched_switch_dynamic,
>> +       TP_PROTO(bool preempt),
>> +       TP_ARGS(preempt));
> The above is just creating a tracepoint and no "event" is attached to it.
>
> tracepoints (defined in include/linux/tracepoints.h) are what creates the
> hooks within the kernel. The trace_##call() functions, like trace_sched_switch().
> A trace event is something that can be attached to tracepoints, and show up
> in the tracefs file system. The TRACE_EVENT() macro will create both a
> tracepoint and a trace event that attaches to the tracepoint it created.
>
>> +
>>    #endif /* _TRACE_SCHED_H */
>>                                                                                                                                             
>>    /* This part must be outside protection */
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 065f9188b44a..37391eb5089f 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -10749,6 +10749,7 @@ static inline bool perf_event_is_tracing(struct perf_event *event)
>>    int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
>>                               u64 bpf_cookie)
>>    {
>> +       u32 dyn_tp_flags = TRACE_EVENT_FL_DYNAMIC | TRACE_EVENT_FL_FPROBE;
>>           bool is_kprobe, is_uprobe, is_tracepoint, is_syscall_tp;
>>     
>>           if (!perf_event_is_tracing(event))
>> @@ -10756,7 +10757,9 @@ int perf_event_set_bpf_prog(struct perf_event *event, struct bpf_prog *prog,
>>     
>>           is_kprobe = event->tp_event->flags & TRACE_EVENT_FL_KPROBE;
>>           is_uprobe = event->tp_event->flags & TRACE_EVENT_FL_UPROBE;
>> -       is_tracepoint = event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT;
>> +       is_tracepoint = (event->tp_event->flags & TRACE_EVENT_FL_TRACEPOINT) ||
>> +                       ((event->tp_event->flags & dyn_tp_flags) == dyn_tp_flags);
>> +
>>           is_syscall_tp = is_syscall_trace_event(event->tp_event);
>>           if (!is_kprobe && !is_uprobe && !is_tracepoint && !is_syscall_tp)
>>                   /* bpf programs can only be attached to u/kprobe or tracepoint */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3e5a6bf587f9..53b3d9e20d00 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6750,6 +6750,7 @@ static void __sched notrace __schedule(int sched_mode)
>>                   psi_account_irqtime(rq, prev, next);
>>                   psi_sched_switch(prev, next, block);
>>     
>> +               trace_sched_switch_dynamic(preempt);
>>                   trace_sched_switch(preempt, prev, next, prev_state);
>>     
>>                   /* Also unlocks the rq: */
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index 71c1c02ca7a3..8f9fd2f347ef 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -2448,7 +2448,8 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
>>                               u64 *probe_offset, u64 *probe_addr,
>>                               unsigned long *missed)
>>    {
>> -       bool is_tracepoint, is_syscall_tp;
>> +       u32 dyn_tp_flags = TRACE_EVENT_FL_DYNAMIC | TRACE_EVENT_FL_FPROBE;
>> +       bool is_tracepoint, is_dyn_tracepoint, is_syscall_tp;
>>           struct bpf_prog *prog;
>>           int flags, err = 0;
>>     
>> @@ -2463,9 +2464,10 @@ int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
>>           *prog_id = prog->aux->id;
>>           flags = event->tp_event->flags;
>>           is_tracepoint = flags & TRACE_EVENT_FL_TRACEPOINT;
>> +       is_dyn_tracepoint = (event->tp_event->flags & dyn_tp_flags) == dyn_tp_flags;
>>           is_syscall_tp = is_syscall_trace_event(event->tp_event);
>>     
>> -       if (is_tracepoint || is_syscall_tp) {
>> +       if (is_tracepoint || is_dyn_tracepoint || is_syscall_tp) {
>>                   *buf = is_tracepoint ? event->tp_event->tp->name
>>                                        : event->tp_event->name;
>>                   /* We allow NULL pointer for tracepoint */
>> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
>> index c62d1629cffe..bacc4a1f5f20 100644
>> --- a/kernel/trace/trace_fprobe.c
>> +++ b/kernel/trace/trace_fprobe.c
>> @@ -436,8 +436,10 @@ static struct trace_fprobe *find_trace_fprobe(const char *event,
>>     
>>    static inline int __enable_trace_fprobe(struct trace_fprobe *tf)
>>    {
>> -       if (trace_fprobe_is_registered(tf))
>> +       if (trace_fprobe_is_registered(tf)) {
>> +               pr_warn("fprobe is enabled\n");
>>                   enable_fprobe(&tf->fp);
>> +       }
>>     
>>           return 0;
>>    }
>> diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
>> index 1702aa592c2c..423770aa581e 100644
>> --- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
>> +++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
>> @@ -196,6 +196,8 @@ static void test_send_signal_common(struct perf_event_attr *attr,
>>           /* notify child safe to exit */
>>           ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
>>     
>> +       ASSERT_EQ(skel->bss->dyn_tp_visited, 1, "dyn_tp_visited");
>> +
>>    disable_pmu:
>>           close(pmu_fd);
>>    destroy_skel:
>> @@ -260,6 +262,8 @@ void test_send_signal(void)
>>    {
>>           if (test__start_subtest("send_signal_tracepoint"))
>>                   test_send_signal_tracepoint(false, false);
>> +/* Disable all other subtests except above send_signal_tracepoint. */
>> +if (0) {
>>           if (test__start_subtest("send_signal_perf"))
>>                   test_send_signal_perf(false, false);
>>           if (test__start_subtest("send_signal_nmi"))
>> @@ -285,3 +289,4 @@ void test_send_signal(void)
>>           if (test__start_subtest("send_signal_nmi_thread_remote"))
>>                   test_send_signal_nmi(true, true);
>>    }
> The above looks like you are trying to create your own dynamic event on top
> of the tracepoint you made.
>
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>> index 176a355e3062..9b580d437046 100644
>> --- a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>> +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
>> @@ -60,6 +60,20 @@ int send_signal_tp_sched(void *ctx)
>>           return bpf_send_signal_test(ctx);
>>    }
>>     
>> +int dyn_tp_visited = 0;
>> +#if 1
>> +/* This will fail */
>> +SEC("tracepoint/tracepoints/sched_switch_dynamic")
>> +#else
>> +/* This will succeed */
>> +SEC("tracepoint/sched/sched_switch")
>> +#endif
> I don't know bpf code at all, so I don't know what this is trying to do. Is
> it looking at the tracefs file system?

Thanks for the above clarification for terminologies.
For the above code, yet, I try to look at tracefs file system since
at runtime once we add 'sched_switch_dynamic' to dynamic_events,
the above tracepoint tracepoints/sched_switch_dynamic will show up
in /sys/kernel/debug/tracing/events/ directory.

>
> What I was suggesting, was not to use trace events at all (nothing should
> be added to the tracefs files system). You would need some code inside the
> kernel to search for tracepoints that are not exported to tracefs and then
> BPF could provide its own hooks to them. Isn't this what BPF raw tracepoints do?

Thanks for suggestion! I tried and raw tracepoint indeed work. The following
is a hack:

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9ea4c404bd4e..db28c5c37a10 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -824,6 +824,10 @@ DECLARE_TRACE(sched_compute_energy_tp,
                  unsigned long max_util, unsigned long busy_time),
         TP_ARGS(p, dst_cpu, energy, max_util, busy_time));
                                                                                                                                           
+DECLARE_TRACE(sched_switch_hack,
+       TP_PROTO(bool preempt),
+       TP_ARGS(preempt));
+
  #endif /* _TRACE_SCHED_H */
                                                                                                                                           
  /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 88a9a515b2ba..df3e52d89c94 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6754,6 +6754,7 @@ static void __sched notrace __schedule(int sched_mode)
                 psi_sched_switch(prev, next, !task_on_rq_queued(prev) ||
                                              prev->se.sched_delayed);
                                                                                                                                           
+               trace_sched_switch_hack(preempt);
                 trace_sched_switch(preempt, prev, next, prev_state);
   
                 /* Also unlocks the rq: */
diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
index 1702aa592c2c..937d5e05fe08 100644
--- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
@@ -195,6 +195,8 @@ static void test_send_signal_common(struct perf_event_attr *attr,
   
         /* notify child safe to exit */
         ASSERT_EQ(write(pipe_p2c[1], buf, 1), 1, "pipe_write");
+       if (!attr)
+               ASSERT_EQ(skel->bss->dyn_tp_visited, 1, "dyn_tp_visited");
   
  disable_pmu:
         close(pmu_fd);
diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
index 176a355e3062..abed1a55ae3b 100644
--- a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c
@@ -60,6 +60,15 @@ int send_signal_tp_sched(void *ctx)
         return bpf_send_signal_test(ctx);
  }
   
+int dyn_tp_visited = 0;
+
+SEC("raw_tp/sched_switch_hack")
+int send_signal_dyn_tp_sched(void *ctx)
+{
+       dyn_tp_visited = 1;
+       return 0;
+}
+
  SEC("perf_event")
  int send_signal_perf(void *ctx)
  {

The raw_tp/sched_switch_hack works great!

>
>> +int send_signal_dyn_tp_sched(void *ctx)
>> +{
>> +       dyn_tp_visited = 1;
>> +       return 0;
>> +}
>> +
>>    SEC("perf_event")
>>    int send_signal_perf(void *ctx)
>>    {
>>
>> To test the above, build the latest bpf-next and then apply the above change.
>> Boot the updated kernel and boot into a qemu VM and run bpf selftest
>>     ./test_progs -t send_signal
>>
>> With tracepoint tracepoint/sched/sched_switch, the test result is correct.
>> With tracepoint tracepoint/tracepoints/sched_switch_dynamic, the test failed.
>> The test failure means the sched_switch_dynamic tracepoint is not
>> triggered with the bpf program.
>>
>> As expected, do
>>     echo 1 > /sys/kernel/debug/tracing/events/tracepoints/sched_switch_dynamic/enable
>> and the sched_switch_dynamic tracepoint works as expected (trace_pipe can dump
>> the expected result).
>>
>> For both 'echo 1 > .../sched_switch_dynamic/enable' approach and
>> bpf tracepoint tracepoint/tracepoints/sched_switch_dynamic, the message
>>     fprobe is enabled
>> is printed out in dmesg. The following is enable_fprobe() code.
>>
>> /**
>>    * enable_fprobe() - Enable fprobe
>>    * @fp: The fprobe to be enabled.
>>    *
>>    * This will soft-enable @fp.
>>    */
>> static inline void enable_fprobe(struct fprobe *fp)
>> {
>>           if (fp)
>>                   fp->flags &= ~FPROBE_FL_DISABLED;
>> }
>>
>> Note that in the above the fprobe/dynamic_events is soft-enable.
>> Maybe the bpf tracepoint/tracepoints/sched_switch_dynamic only has
>> soft-enable and 'echo 1 > ...' approach has both soft-enable and
>> actual hard-enable (at pmu level)? If this is the case, what is
>> missing for hard-enable for bpf dynamic_events case? Do you have
>> any suggestions?
>>
>>
>> The above prototype tries to reuse the existing infra/API.
>> If you have better way to support dynamic_events, please let me know.
> Either create a full trace event (one that is exposed in tracefs), or add a
> direct hook to the tracepoint you want. What the above looks like is some
> kind of mixture of the two. No "dynamic events" should be involed.

Indeed, as you suggested in the above, raw_tracepoint works fine for bpf
for those DECLARE_TRACE tracepoints. We can just use it.

>
> -- Steve
>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ