netdev - Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150210080526.1d8a119e@grimm.local.home>
Date:	Tue, 10 Feb 2015 08:05:26 -0500
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Alexei Starovoitov <ast@...mgrid.com>
Cc:	Ingo Molnar <mingo@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
	Arnaldo Carvalho de Melo <acme@...radead.org>,
	Jiri Olsa <jolsa@...hat.com>,
	Masami Hiramatsu <masami.hiramatsu.pt@...achi.com>,
	Linux API <linux-api@...r.kernel.org>,
	Network Development <netdev@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to
 tracepoints and syscalls

On Mon, 9 Feb 2015 22:10:45 -0800
Alexei Starovoitov <ast@...mgrid.com> wrote:

> One can argue that current TP_printk format is already an ABI,
> because somebody might be parsing the text output.

If somebody does, then it is an ABI. Luckily, it's not that useful to
parse, thus it hasn't been an issue. As Linus has stated in the past,
it's not that we can't change ABI interfaces, its just that we can not
change them if there's a user space application that depends on it.

The harder we make it for interface changes to break user space, the
better. The field layouts is a user interface. In fact, some of those
fields must always be there. This is because tools do parse the layout
and expect some events to have specific fields. Now we can add new
fields, or even remove fields that no user space tool is using. This is
because today, tools use libtraceevent to parse the event data.

This is why I'm nervous about exporting the parameters of the trace
event call. Right now, those parameters can always change, because the
only way to know they exist is by looking at the code. And currently,
there's no way to interact with those parameters. Once we have eBPF in
mainline, we now have a way to interact with the parameters and if
those parameters change, then the eBPF program will break, and if eBPF
can be part of a user space tool, that will break that tool and
whatever change in the trace point that caused this breakage would have
to be reverted. IOW, this can limit development in the kernel.

Al Viro currently does not let any tracepoint in VFS because he doesn't
want the internals of that code locked to an ABI. He's right to be
worried.

> so in some cases we cannot change tracepoints without
> somebody complaining that his tool broke.
> In other cases tracepoints are used for debugging only
> and no one will notice when they change...
> It was and still a grey area.

Not really. If a tool uses the tracepoint, it can lock that tracepoint
down. This is exactly what latencytop did. It happened, it's not a
hypothetical situation.

> bpf doesn't change any of that.
> It actually makes addition of new tracepoints easier.

I totally disagree. It adds more ways to see inside the kernel, and if
user space depends on this, it adds more ways the kernel can not change.

It comes down to how robust eBPF is with the internals of the kernel
changing. If we limit eBPF to system call tracepoints only, that's
fine because those have the same ABI as the system call itself. I'm
worried about the internal tracepoints for scheduling, irqs, file
systems, etc.


> In the future we might add a tracepoint and pass a single
> pointer to interesting data struct to it. bpf programs will walk
> data structures 'as safe modules' via bpf_fetch*() methods
> without exposing it as ABI.

Will this work if that structure changes? When the field we are looking
for no longer exists?

> whereas today we pass a lot of fields to tracepoints and
> make all of these fields immutable.

The parameters passed to the tracepoint are not shown to userspace and
can change at will. Now, we present the final parsing of the parameters
that convert to fields. As all currently known tools uses
libtraceevent.a, and parse the format files, those fields can move
around and even change in size. The structures are not immutable. The
fields are locked down if user space relies on them. But they can move
about within the tracepoint, because the parsing allows for it.

Remember, these are processed fields. The result of TP_fast_assign()
and what gets put into the ring buffer. Now what is passed to the
actual tracepoint is not visible by userspace, and in lots of cases, it
is just a pointer to some structure. What eBPF brings to the table is a
way to access this structure from user space. What keeps a structured
passed to a tracepoint from becoming immutable if there's a eBPF
program that expects it to have a specific field?


> 
> To me tracepoints are like gdb breakpoints.

Unfortunately, it doesn't matter what you think they are. It doesn't
matter what I think they are. What matters is what Linus thinks they
are and if user space depends on it and Linus decides to revert what
ever change broke that user space program, no matter what we think, we
just screwed ourselves.

I'm being stubborn on this because this is exactly what happened in the
past, which caused every trace point to hold 4 bytes of padding. 4
bytes may not sound like a lot, but when you have 1 million
tracepoints, that's 4 megs of wasted space.

> and bpf programs like live debugger that examine things.

If bpf programs only dealt with kprobes, I may agree. But tracepoints
have already been proven to be a type of ABI. If we open another window
into the kernel, this can screw us later. It's better to solve this now
than when we are fighting with Linus over user space breakage.

> 
> the next step is to be able to write bpf scripts on the fly
> without leaving debugger. Something like perf probe +
> editor + live execution. Truly like gdb for kernel.
> while kernel is running.

What we need is to know if eBPF programs are modules or a user space
interface. If they are a user interface then we need to be extremely
careful here. If they are treated the same as modules, then it would
not add any API. But that hasn't been settled yet, even if we have a
comment in the kernel.

Maybe what we should do is to make eBPF pass the kernel version it was
made for (with all the mod version checks). If it doesn't match, fail
to load it. Perhaps the more eBPF is limited like modules are, the
better chance we have that no eBPF program creates a new ABI.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html