[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f62b63b1-e7fa-42e2-b5f6-8cf1bd81426d@arm.com>
Date: Tue, 25 Mar 2025 20:13:12 +0000
From: Douglas Raillard <douglas.raillard@....com>
To: Steven Rostedt <rostedt@...dmis.org>
Cc: "Masami Hiramatsu (Google)" <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
Alice Ryhl <aliceryhl@...gle.com>
Subject: Re: [PATCH 3/3] tracing: Rename trace_synth() to synth_event_trace2()
On 25-03-2025 18:16, Douglas Raillard wrote:
> On 25-03-2025 16:25, Steven Rostedt wrote:
>> On Tue, 25 Mar 2025 16:05:00 +0000
>> Douglas Raillard <douglas.raillard@....com> wrote:
>>
>>> Yes, the dynamic API was exactly what I needed unfortunately. I'm porting the kernel module we have to Rust (using our own bindings as
>>> we cannot mandate CONFIG_RUST=y). While I have some support for C code generation based on the Rust source, it is significantly more
>>> convenient to simply use a dynamic API. In the current state of my code, defining an event is as simple as:
>>>
>>> let f = new_event! {
>>> lisa__myevent2,
>>> fields: {
>>> field1: u8,
>>> field3: &CStr,
>>> field2: u64,
>>> }
>>> }?;
>>> // Emit the event
>>> f(1, c"hello", 2);
>>> f(3, c"world", 4);
>>>
>>> So it's as non-confusing can be: the event name is stated plainly and the field names and types are mentioned once, with no repetition.
>>> The result can simply be called to emit the event, and when dropped, the event is unregistered.
>>
>> Interesting.
>>
>>>
>>>
>>> On top of that in ways unrelated to Rust:
>>> 1. We have some really high traffic event (PELT-related) that include some data that is not always needed (e.g. taskgroup name).
>>> We currently regularly hit memory size limitation on our test device (pixel 6), and using trace-cmd record instead of
>>> trace-cmd start is not always a good idea as this will have an effect on scheduling, disturbing the very thing we are trying
>>> to observe. Having the dynamic API means we could simply omit the fields in the event that we don't care about in a specific
>>> experiment via dynamic configuration.
>>>
>>> 2. Some events may or may not be supported on the system based on some runtime condition. Currently, there is no great way for
>>> the module to feed back that info. No matter what, the event exists, even if the machinery that is supposed to emit it is
>>> disabled for whatever reason. If some user starts tracing the event "everything will work" and there will be no event in the
>>> trace. That may make the user think a condition did not trigger, whereas in fact the whole machinery was simply not operating.
>>> Being able to register or not the event dynamically solves that cleanly, as enabling the event will simply fail as it won't
>>> have been registered in the first place.
>>>
>>> 3. We will likely at some point relay events coming from some non-CPU part of the system into the ftrace buffer. If and when that
>>> happens, it would be ideal if that producer can simply declare the events it supports, and our module dynamically create the
>>> matching synthetic events. That would avoid any problem coming from mismatching source that would otherwise plague such setup.
>>>
>>> So considering things seem to work overall with not much tuning needed, it would be sad to have to revert to TRACE_EVENT() macro
>>> and end up having to do more work for a worse result. If that's the route you choose though, it may be nice to amend the
>>> documentation to clearly state this API is testing-only and not supported in normal use case, as it currently reads:
>>>
>>> The trace event subsystem provides an in-kernel API allowing modules
>>> or other kernel code to generate user-defined 'synthetic' events at
>>> will, which can be used to either augment the existing trace stream
>>> and/or signal that a particular important state has occurred.
>>
>> Note, there is also a CUSTOM_TRACE_EVENT() macro that can attach to any
>> tracepoint (including those that have a TRACE_EVENT() already attached to
>> them) and create a new trace event that shows up into tracefs.
>
> I think I came across that at some point indeed. We probably could make use
> of it for some of the events we emit, but not all. Also this was introduced
> in 2022, but I need to support v5.15 kernels (2021), so maybe one day but not today :(
>
> The event that does not fit this use case is emitted from a workqueue worker that
> fires at regular interval to poll for some data. So there is no particular tracepoint
> to attach to, we just emit an event from our own code. In the future, we will
> probably end up having more of that sort.
>
>> I still do not think "synthetic events" should be used for this purpose.
>> They are complex enough with the user interface we shouldn't have them
>> created in the kernel. That documentation should be rewritten to not make
>> it sound like the kernel or modules can create them.
>>
>> That said, it doesn't mean you can't use dynamic events for this purpose.
>> Now what exactly is that "new_event!" doing?
>
> That new_event!() macro takes the name and the list of fields and creates two related
> things:
> 1. An EventDesc value that contains the a Vec of the field names and their type.
> Upon creation, this registers the synthetic event and unregisters it when dropped.
>
> 2. An associated closure that takes one parameter per field, builds an array out of them,
> and using the captured EventDesc emits the events.
>
> Since the closure captures the EventDesc, the EventDesc cannot be dropped while some code
> is able to call the closure, and since the EventDesc is responsible for the lifecycle of
> the kernel synthetic event, there is no risk of "use after unregister".
>
> The value returned by the macro is the closure, so it can just be called like a normal
> function, but dropping it will also drop the captured EventDesc and unregister the event.
>
> The field types have to implement a FieldTy trait, which exposes both the type name
> as expected by the synthetic events API and conversion of a value to u64 for preparing
> the array passed to synth_event_trace_array():
>
> pub trait FieldTy {
> const NAME: &'static str;
> fn to_u64(self) -> u64;
> }
>
> Once the whole thing is shipping, I plan on adding a way to pass a runtime list of fields
> to actually enable to new_event!(), so that the event size can be dynamically reduced to
> what is needed and save precious MB of RAM in the buffer, e.g.:
>
> // Coming from some runtime user config, via module parameter or sysfs
> let enabled_fields = vec!["field", "field2"];
>
> let f = crate::runtime::traceevent::new_event! {
> lisa__myevent2,
> fields: {
> field1: u8,
> field3: &CStr,
> field2: u64,
> },
> enabled_fields,
> }?;
>
>> I guess, what's different to that than a kprobe event? A kprobe event is a
>> dynamic trace event in the kernel. Could you do the same with that? Or is
>> this adding a nop in the code like a real event would?
>
> Considering we not only need to emit events from existing tracepoints but also just from
> our own code (workqueue worker gathering data not obtained via tracepoints), I don't think
> kprobe-based solution is best unless I missed some usage of it.
>
>>
>> I'm not against rust dynamic events, but I do not think synthetic events
>> are the answer. It will just cause more confusion.
>
> Is there some way of creating an event that would not hardcode the detail at compile-time
> like the TRACE_EVENT() macro does ? In ways that are not designed to specifically feed a
> well-known upstream data source into ftrace like tracepoints or kprobes. Our kernel module
> is essentially sitting at the same layer as TRACE_CUSTOM_EVENT() or kprobe events, where
> it takes some data source (some of it being from our own module) and feeds it into ftrace.
>
> Maybe what I'm looking for is the dynevent_cmd API ? From the kernel doc:
>
> Both the synthetic event and k/ret/probe event APIs are built on top of a
> lower-level “dynevent_cmd” event command API, which is also available for more
> specialized applications, or as the basis of other higher-level trace event
> APIs.
Answering to myself, this is a no-go, as this is indeed the basis of other high-level trace event
APIs, but not ones created from modules since almost none of that API is exported.
>
> I'd rather avoid shipping an out-of-tree reimplementation of synthetic events if possible,
> but if that API is public and usable from a module I could live with that as long as it allows
> creating "well behaved" events (available in tracefs, enable/disable working,
> instances working etc) and is stable enough that the same code can reasonably be expected to work
> from v5.15 to now with minor tweaks only.
>
>> I also added Alice to the Cc, as she has created trace events for Rust.
>
> Interesting, I could not find anything in the upstream bindings. All there seems to be is
> a crude tracepoint binding that only allows emitting a tracepoint already defined in C.
> If that's of any interest, I also have code to attach probes to tracepoints dynamically from Rust [1]
>
> My plan B for emitting events is to use the infrastructure I have to include C-based function in the
> Rust codebase to sneakily use the TRACE_EVENT() macro like any non-confusing module [2].
Actually plan B can't fly, there is too much similar but different repetition in the TRACE_EVENT() macro
(TP_ARGS() and TP_PROTO()) and the C syntax is not helping either (trailing comma forbidden).
This will have to be plan C with a proc macro defining a custom syntax and generating the string literal
at compile time, or defining it in C sources and dealing with copy/pasted Rust bindings.
> [1] The tracepoint probe bindings I have looks like that:
>
> // Create a probe that increments a captured atomic counter.
> //
> // The new_probe!() macro creates:
> // 1. The closure
> // 2. A probe function that uses the first void* arg to pass the closure pointer
> // and then calls it
> // 3. A Probe value that ties together both 1. and 2.. This is the value returned
> // by the macro.
> let x = AtomicUsize::new(0);
> let probe = new_probe!(
> // That dropper is responsible for dropping all the probes attached to it,
> // so that we pay only once for tracepoint_synchronize_unregister() rather
> // than for each probe.
> &dropper,
> // Essentially a closure, with slightly odd syntax for boring reasons.
> (preempt: bool, prev: *const c_void, next:* const c_void, prev_state: c_ulong) {
> let x = x.fetch_add(1, Ordering::SeqCst);
> crate::runtime::printk::pr_info!("SCHED_SWITCH {x}");
> }
> );
>
> // Lookup the tracepoint dynamically. It's unsafe as the lifetime of a
> // struct tracepoint is unknown if defined in a module, 'static otherwise.
> // Also unsafe since the tp signature is not checked, as the lookup is 100% dynamic.
> // I could probably make use of typeof(trace_foobar) function to typecheck that against the
> // C API if the lookup was not dynamic, at the cost of failing to load the module rather than
> // being able to dynamically deal with the failure if the tp does not exist.
> let tp = unsafe {
> Tracepoint::<(bool, *const c_void, * const c_void, c_ulong)>::lookup("sched_switch").expect("tp not found")
> };
>
> // Attach the probe to the tracepoint. If the signature does not match, it won't typecheck.
> let registered = tp.register_probe(&probe);
>
> // When dropped, the RegisteredProbe value detaches the probe from the
> // tracepoint. The probe will only be actually deallocated when its
> // associated dropper is dropped, after which the dropper calls
> // tracepoint_synchronize_unregister().
> drop(registered);
>
>
> [2] FFI infrastructure to include C code inside the Rust code for bindings writing.
> This is a function defined and usable inside the Rust code that has a body written in C:
>
> [cfunc]
> fn myfunction(param: u64) -> bool {
> r#"
> // Any C code allowed, we are just before the function
> // definition
> #include <linux/foo.h>
> ";
> r#"
> // Any C code allowed, we are inside a function here.
> return call_foo_function(param);
> ";
> }
>
> The r#" "# syntax is just Rust multiline string literals. The optional first literal gets dumped before the function,
> the mandatory second one inside, and an optional 3rd one after. The #[cfunc] proc macro takes care of generating the C
> prototype matching the Rust one and to wire everything in a way that CFI is happy with. Since the C code is generated
> at compile time as a string, any const operation is allowed on it, so I could work my way to a snippet of C that
> uses TRACE_EVENT() in the first string literal and calls the trace_* function in the function body.
>
> The C code ends up in a section of the Rust object file that is extracted with objdump in the Makefile and fed back to
> Kbuild. That's very convenient to use and avoids splitting away one-liners like that, does not break randomly
> like cbindgen on some Rust code I don't control coming from 3rd party crates, and since the module is built at runtime
> by the user, it also has the big advantage of not requiring to build cbindgen in the "tepid" path.
>
>> -- Steve
>
>
> Douglas
Powered by blists - more mailing lists