linux-kernel - Re: [PATCH v8 00/12] user_events: Enable user processes to create and write to trace events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20220419002549.GA2055@kbox>
Date:   Mon, 18 Apr 2022 17:25:49 -0700
From:   Beau Belgrave <beaub@...ux.microsoft.com>
To:     Hagen Paul Pfeifer <hagen@...u.net>
Cc:     rostedt@...dmis.org, mhiramat@...nel.org,
        linux-trace-devel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Subject: Re: [PATCH v8 00/12] user_events: Enable user processes to create
 and write to trace events

On Mon, Apr 18, 2022 at 10:43:29PM +0200, Hagen Paul Pfeifer wrote:
> * Beau Belgrave | 2021-12-16 09:34:59 [-0800]:
> 
> >The typical scenario is on process start to mmap user_events_status. Processes
> >then register the events they plan to use via the REG ioctl. The ioctl reads
> >and updates the passed in user_reg struct. The status_index of the struct is
> >used to know the byte in the status page to check for that event. The
> >write_index of the struct is used to describe that event when writing out to
> >the fd that was used for the ioctl call. The data must always include this
> >index first when writing out data for an event. Data can be written either by
> >write() or by writev().
> 
> Hey Beau, a little bit late to the party. A few questions from my side: What
> are the exact weak points of USDT compared to User Events that stand in the
> way of further extend USDT (in a non-compatible way, sure, just as an
> different approach!)? The nice thing about USDT is that I can search for all
> possible probes of the system via "find / | readelf | ". Since they are listed
> in a dedicated ELF section (.note.stapsdt) - they are visible & transparent. I
> can also map a hierarchy/structure in Executable/DSO via clever choice of
> names. The big disadvantage of USDT is the lack of type information, but from
> a registration, explicit point of view, they are nice.
> 
> Or in other words: why not extends the USDT approach? Why not
> 
> u32 val = 23;
> const char *garbage = "tracestring";
> 
> DYNAMIC_TRACE_PROBE2("foo:bar", val, u32, garbage, cstring);
> 

We actually tried some USDT extension methods early on, by extending the
.note.stapsdt sections and seeing how far we could get our definitions
into that form.

There are a few problems when running in a highly container/CGROUP
environment even if you can get our formats into stapsdt.

It costs a lot to transverse every ELF file on the machine to find all
the notes. When profiling or tracing many containers, each cgroup's
mount space must be entered and then tracked. Since these files are in
different locations, they each need a separate probe definition, since
the definitions/patches are tied to the location of the binary to patch.

As new cgroups come online, we would have to keep track of each new
binary location and find probes that match their location. This becomes
really hard to manage if for example we just want to always enable a
specific event regardless of where it is on the filesystem. Events are
limited to a max of 2^16 having many duplicate events in the system
might start to approach that limit for high-core machines with many
small cgroup isolations.

We run programs that are built on interpreted or JIT'd code (C#,
javascript, etc.). These don't have great places to put a stap
definition, since they aren't ELF files. I've seen approaches where
temporary ELF files are generated, however, this costs a lot. Now we
have even more temporarily files to go patch, meaning more events and
more probe definitions (many of them in our case would be duplicates of
the others).

In production environments we have them locked down heavily with both
SELINUX and IPE enabled. This prevents us from patching user mode code
on the fly, the typical perf probe calls fail here.

We typically want to know what events are available to us with very
little overhead. Having programs register to a well known location
already (trace_events, tracefs) I can easily see all the user events on
the system by just doing ls on /sys/kernel/tracing/events/user_events. I
can also see all their data formats and easily enable hist and filtering
since these formats are known to the kernel.

In our testing uprobes are much more costly to the running program than
the write syscall.

For managed code, as in java, code is moving around and are not always
in static locations. The probe locations can change, etc. Calling from
a managed location into a native one has performance implications as
well when using a dynamic/temp elf stub approach.

We are actively using user_events to solve these problems in our
environments that have previously seen high overheads to achieve the
same results. Many times we cannot afford to miss any events, so live
scanning for new ELF files doesn't work for us as the programs and
cgroups are short lived.

> 
> Sure, the argument names, here "val" and "garbage" should also be saved. I
> also like the "just one additional header to the project to get things
> running" (#include "sdt.h"). Sure, a DYNAMIC_TRACE_IS_ACTIVE("foo:bar") would
> be great. But in fact we have never needed that in the past.
> 
> 
> hgn

Thanks,
-Beau