linux-kernel - Re: Edited draft of bpf(2) man page for review/enhancement

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55658DB4.6000106@iogearbox.net>
Date:	Wed, 27 May 2015 11:26:12 +0200
From:	Daniel Borkmann <daniel@...earbox.net>
To:	"Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>,
	Alexei Starovoitov <ast@...mgrid.com>
CC:	Silvan Jegen <s.jegen@...il.com>, linux-man@...r.kernel.org,
	linux-kernel@...r.kernel.org, Walter Harms <wharms@....de>
Subject: Re: Edited draft of bpf(2) man page for review/enhancement

On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote:
> Hello Alexei,
>
> I took the draft 3 of the bpf(2) man page that you sent back in March
> and did some substantial editing to clarify the language and add a
> few technical details. Could you please check the revised  version
> below, to ensure I did not inject any errors.
>
> I also added a number of FIXMEs for pieces of the page that need
> further work. Could you take a look at these and let me know your
> thoughts, please.

That's great, thanks! Minor comments:

...
> .TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual"
> .SH NAME
> bpf - perform a command on an extended BPF map or program
> .SH SYNOPSIS
> .nf
> .B #include <linux/bpf.h>
> .sp
> .BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
>
> .SH DESCRIPTION
> The
> .BR bpf ()
> system call performs a range of operations related to extended
> Berkeley Packet Filters.
> Extended BPF (or eBPF) is similar to
> the original BPF (or classic BPF) used to filter network packets.
> For both BPF and eBPF programs,
> the kernel statically analyzes the programs before loading them,
> in order to ensure that they cannot harm the running system.
> .P
> eBPF extends classic BPF in multiple ways including the ability to call
> in-kernel helper functions (via the
> .B BPF_CALL
> opcode extension provided by eBPF)
> and access shared data structures such as BPF maps.

I would perhaps emphasize that maps can be shared among in-kernel
eBPF programs, but also between kernel and user space.

> The programs can be written in a restricted C that is compiled into
> .\" FIXME In the next line, what is "a restricted C"? Where does
> .\"       one get further information about it?

So far only from the kernel samples directory and for tc classifier
and action, from the tc man page and/or examples/bpf/ in the tc git
tree.

> eBPF bytecode and executed on the in-kernel virtual machine or
> just-in-time compiled into native code.
> .SS Extended BPF Design/Architecture
> .P
> .\" FIXME In the following line, what does "different data types" mean?
> .\"       Are the values in a map not just blobs?

Sort of, currently, these blobs can have different sizes of keys
and values (you can even have structs as keys). For the map itself
they are treated as blob internally. However, recently, bpf tail call
got added where you can lookup another program from an array map and
call into it. Here, that particular type of map can only have entries
of type of eBPF program fd. I think, if needed, adding a paragraph to
the tail call could be done as follow-up after we have an initial man
page in the tree included.

> BPF maps are a generic data structure for storage of different data types.
> A user process can create multiple maps (with key/value-pairs being
> opaque bytes of data) and access them via file descriptors.
> BPF programs can access maps from inside the kernel in parallel.
> It's up to the user process and BPF program to decide what they store
> inside maps.
> .P
> BPF programs are similar to kernel modules.
> They are loaded by the user
> process and automatically unloaded when the process exits.

Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for
eBPF classifier and actions, and here it's slightly different: in tc,
we load the programs, maps etc, and push down the eBPF program fd in
order to let the kernel hold reference on the program itself.

Thus, there, the program fd that the application owns is gone when the
application terminates, but the eBPF program itself still lives on
inside the kernel. But perhaps it's already too much detail to mention
here ...

> Each BPF program is a set of instructions that is safe to run until
> its completion.
> The in-kernel BPF verifier statically determines that the program
> terminates and is safe to execute.
> .\" FIXME In the following sentence, what does "takes hold" mean?

Takes a reference. Meaning, that maps cannot disappear under us while
the eBPF program that is using them in the kernel is still alive.

> During verification, the program takes hold of maps that it intends to use,
> so selected maps cannot be removed until the program is unloaded.
>
> BPF programs can be attached to different events.
> .\" FIXME: In the next sentence , "packets" are not "events". What
> .\" do you really mean to say here? ("the arrival of a network packet"?)

Socket filters is meant here: setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, ...);

> These events can be packets, tracing
> events, and other types that may be added in the future.

There's already one more type worth mentioning: tc classifier and actions
(tc/tc_bpf.{c,h}).

> A new event triggers execution of the BPF program, which
> may store information about the event in the maps.
> Beyond storing data, BPF programs may call into in-kernel helper functions.

I would mention that these in-kernel helpers are a fixed set of functions
provided by the kernel (linux/bpf.h: BPF_FUNC_*), so you cannot call into
arbitrary ones.

> The same program can be attached to multiple events and different programs can
> access the same map:
> .\" FIXME Can maps be shared between processes? (E.g., what happens
> .\"       when fork() is called?)

Yes, they can. Map file descriptors can also be transferred via socket
passing (SCM_RIGHTS). tc is doing that, i.e. since tc terminates itself
after configuring the kernel, it can pass the map fds via Unix domain
socket to an BPF agent that will be the new owner. That can f.e. be a
shell, so this is very well possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/