lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1bb51d4d3bc272e787bd963671047712bed97ada.1432246443.git.daniel@iogearbox.net>
Date:	Fri, 22 May 2015 00:17:01 +0200
From:	Daniel Borkmann <daniel@...earbox.net>
To:	stephen@...workplumber.org
Cc:	ast@...mgrid.com, netdev@...r.kernel.org,
	Daniel Borkmann <daniel@...earbox.net>
Subject: [PATCH iproute2] tc: bpf: add initial man page

Add a start of a man-page to the misc section as a reference and
guide on (e)BPF classifier and actions. Given that tc is only tersely
documented, this is provided in the hope that users will have an
easier getting started with tc and (e)BPF. And, that there's now more
incentive for others to also start documenting their classifier and
actions as well. ;)

Signed-off-by: Daniel Borkmann <daniel@...earbox.net>
---
 man/man8/Makefile |   2 +-
 man/man8/tc-bpf.8 | 924 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 man/man8/tc.8     |   1 +
 3 files changed, 926 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/tc-bpf.8

diff --git a/man/man8/Makefile b/man/man8/Makefile
index 152747a..3df9961 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -1,7 +1,7 @@
 TARGETS = ip-address.8 ip-link.8 ip-route.8
 
 MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 rtmon.8 ss.8 \
-	tc.8 tc-bfifo.8 tc-cbq.8 tc-cbq-details.8 tc-choke.8 tc-codel.8 \
+	tc.8 tc-bfifo.8 tc-bpf.8 tc-cbq.8 tc-cbq-details.8 tc-choke.8 tc-codel.8 \
 	tc-drr.8 tc-ematch.8 tc-fq_codel.8 tc-hfsc.8 tc-htb.8 tc-pie.8 \
 	tc-mqprio.8 tc-netem.8 tc-pfifo.8 tc-pfifo_fast.8 tc-prio.8 tc-red.8 \
 	tc-sfb.8 tc-sfq.8 tc-stab.8 tc-tbf.8 \
diff --git a/man/man8/tc-bpf.8 b/man/man8/tc-bpf.8
new file mode 100644
index 0000000..2c02ab2
--- /dev/null
+++ b/man/man8/tc-bpf.8
@@ -0,0 +1,924 @@
+.TH "BPF classifier and actions in tc" 8 "18 May 2015" "iproute2" "Linux"
+.SH NAME
+BPF \- BPF programmable classifier and actions for ingress/egress
+queueing disciplines
+.SH SYNOPSIS
+.SS eBPF classifier (filter) or action:
+.B tc filter ... bpf
+[
+.B object-file
+OBJ_FILE ] [
+.B section
+CLS_NAME ] [
+.B export
+UDS_FILE ] [
+.B verbose
+] [
+.B police
+POLICE_SPEC ] [
+.B action
+ACTION_SPEC ] [
+.B classid
+CLASSID ]
+.br
+.B tc action ... bpf
+[
+.B object-file
+OBJ_FILE ] [
+.B section
+CLS_NAME ] [
+.B export
+UDS_FILE ] [
+.B verbose
+]
+
+.SS cBPF classifier (filter) or action:
+.B tc filter ... bpf
+[
+.B bytecode-file
+BPF_FILE |
+.B bytecode
+BPF_BYTECODE ] [
+.B police
+POLICE_SPEC ] [
+.B action
+ACTION_SPEC ] [
+.B classid
+CLASSID ]
+.br
+.B tc action ... bpf
+[
+.B bytecode-file
+BPF_FILE |
+.B bytecode
+BPF_BYTECODE ]
+
+.SH DESCRIPTION
+
+Extended Berkeley Packet Filter (
+.B eBPF
+) and classic Berkeley Packet Filter
+(originally known as BPF, for better distinction referred to as
+.B cBPF
+here) are both available as a fully programmable and highly efficient
+classifier and actions. They both offer a minimal instruction set for
+implementing small programs which can safely be loaded into the kernel
+and thus executed in a tiny virtual machine from kernel space. An in-kernel
+verifier guarantees that a specified program always terminates and neither
+crashes nor leaks data from the kernel.
+
+In Linux, it's generally considered that eBPF is the successor of cBPF.
+The kernel internally transforms cBPF expressions into eBPF expressions and
+executes the latter. Execution of them can be performed in an interpreter
+or at setup time, they can be just-in-time compiled (JIT'ed) to run as
+native machine code. Currently, x86_64, ARM64 and s390 architectures have
+eBPF JIT support, whereas PPC, SPARC, ARM and MIPS have cBPF, but did not
+(yet) switch to eBPF JIT support.
+
+eBPF's instruction set has similar underlying principles as the cBPF
+instruction set, it however is modelled closer to the underlying
+architecture to better mimic native instruction sets with the aim to
+achieve a better run-time performance. It is designed to be JIT'ed with
+a one to one mapping, which can also open up the possibility for compilers
+to generate optimized eBPF code through an eBPF backend that performs
+almost as fast as natively compiled code. Given that LLVM provides such
+an eBPF backend, eBPF programs can therefore easily be programmed in a
+subset of the C language. Other than that, eBPF infrastructure also comes
+with a construct called "maps". eBPF maps are key/value stores that are
+shared between multiple eBPF programs, but also between eBPF programs and
+user space applications.
+
+For the traffic control subsystem, classifier and actions that can be
+attached to ingress and egress qdiscs can be written in eBPF or cBPF. The
+advantage over other classifier and actions is that eBPF/cBPF provides the
+generic framework, while users can implement their highly specialized use
+cases efficiently. This means that the classifier or action written that
+way will not suffer from feature bloat, and can therefore execute its task
+highly efficient. It allows for non-linear classification and even merging
+the action part into the classification. Combined with efficient eBPF map
+data structures, user space can push new policies like classids into the
+kernel without reloading a classifier, or it can gather statistics that
+are pushed into one map and use another one for dynamically load balancing
+traffic based on the determined load, just to provide a few examples.
+
+.SH PARAMETERS
+.SS object-file
+points to an object file that has an executable and linkable format (ELF)
+and contains eBPF opcodes and eBPF map definitions. The LLVM compiler
+infrastructure with
+.B clang(1)
+as a C language front end is one project that supports emitting eBPF object
+files that can be passed to the eBPF classifier (more details in the
+.B EXAMPLES
+section). This option is mandatory when an eBPF classifier or action is
+to be loaded.
+
+.SS section
+is the name of the ELF section from the object file, where the eBPF
+classifier or action resides. By default the section name for the
+classifier is called "classifier", and for the action "action". Given
+that a single object file can contain multiple classifier and actions,
+the corresponding section name needs to be specified, if it differs
+from the defaults.
+
+.SS export
+points to a Unix domain socket file. In case the eBPF object file also
+contains a section named "maps" with eBPF map specifications, then the
+map file descriptors can be handed off via the Unix domain socket to
+an eBPF "agent" herding all descriptors after tc lifetime. This can be
+some third party application implementing the IPC counterpart for the
+import, that uses them for calling into
+.B bpf(2)
+system call to read out or update eBPF map data from user space, for
+example, for monitoring purposes or to push down new policies.
+
+.SS verbose
+if set, it will dump the eBPF verifier output, even if loading the eBPF
+program was successful. By default, only on error, the verifier log is
+being emitted to the user.
+
+.SS police
+is an optional parameter for an eBPF/cBPF classifier that specifies a
+police in
+.B tc(1)
+which is attached to the classifier, for example, on an ingress qdisc.
+
+.SS action
+is an optional parameter for an eBPF/cBPF classifier that specifies a
+subsequent action in
+.B tc(1)
+which is attached to a classifier.
+
+.SS classid
+.SS flowid
+provides the default traffic control class identifier for this eBPF/cBPF
+classifier. The default class identifier can also be overwritten by the
+return code of the eBPF/cBPF program. A default return code of
+.B -1
+specifies the here provided default class identifier to be used. A return
+code of the eBPF/cBPF program of 0 implies that no match took place, and
+a return code other than these two will override the default classid. This
+allows for efficient, non-linear classification with only a single eBPF/cBPF
+program as opposed to having multiple individual programs for various class
+identifiers which would need to reparse packet contents.
+
+.SS bytecode
+is being used for loading cBPF classifier and actions only. The cBPF bytecode
+is directly passed as a text string in the form of
+.B \'s,c t f k,c t f k,c t f k,...\'
+, where
+.B s
+denotes the number of subsequent 4-tuples. One such 4-tuple consists of
+.B c t f k
+decimals, where
+.B c
+represents the cBPF opcode,
+.B t
+the jump true offset target,
+.B f
+the jump false offset target and
+.B k
+the immediate constant/literal. There are various tools that generate code
+in this loadable format, for example,
+.B bpf_asm
+that ships with the Linux kernel source tree under
+.B tools/net/
+, so it is certainly not expected to hack this by hand. The
+.B bytecode
+or
+.B bytecode-file
+option is mandatory when a cBPF classifier or action is to be loaded.
+
+.SS bytecode-file
+also being used to load a cBPF classifier or action. It's effectively the
+same as
+.B bytecode
+only that the cBPF bytecode is not passed directly via command line, but
+rather resides in a text file.
+
+.SH EXAMPLES
+.SS eBPF TOOLING
+A full blown example including eBPF agent code can be found inside the
+iproute2 source package under:
+.B examples/bpf/
+
+As prerequisites, the kernel needs to have the eBPF system call namely
+.B bpf(2)
+enabled and ships with
+.B cls_bpf
+and
+.B act_bpf
+kernel modules for the traffic control subsystem. To enable eBPF/eBPF JIT
+support, depending which of the two the given architecture supports:
+
+.in +4n
+.B echo 1 > /proc/sys/net/core/bpf_jit_enable
+.in
+
+A given restricted C file can be compiled via LLVM as:
+
+.in +4n
+.B clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o
+.in
+
+The compiler invocation might still simplify in future, so for now,
+it's quite handy to alias this construct in one way or another, for
+example:
+.in +4n
+.nf
+.sp
+__bcc() {
+        clang -O2 -emit-llvm -c $1 -o - | \\
+        llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
+}
+
+alias bcc=__bcc
+.fi
+.in
+
+A minimal, stand-alone unit, which matches on all traffic with the
+default classid (return code of -1) looks like:
+
+.in +4n
+.nf
+.sp
+#include <linux/bpf.h>
+
+#ifndef __section
+# define __section(x)  __attribute__((section(x), used))
+#endif
+
+__section("classifier") int cls_main(struct __sk_buff *skb)
+{
+        return -1;
+}
+
+char __license[] __section("license") = "GPL";
+.fi
+.in
+
+More examples can be found further below in subsection
+.B eBPF PROGRAMMING
+as focus here will be on tooling.
+
+There can be various other sections, for example, also for actions.
+Thus, an object file in eBPF can contain multiple entrance points.
+Always a specific entrance point, however, must be specified when
+configuring with tc. A license must be part of the restricted C code
+and the license string syntax is the same as with Linux kernel modules.
+The kernel reserves its right that some eBPF helper functions can be
+restricted to GPL compatible licenses only, and thus may reject a program
+from loading into the kernel when such a license mismatch occurs.
+
+The resulting object file from the compilation can be inspected with
+the usual set of tools that also operate on normal object files, for
+example
+.B objdump(1)
+for inspecting ELF section headers:
+
+.in +4n
+.nf
+.sp
+objdump -h bpf.o
+[...]
+3 classifier    000007f8  0000000000000000  0000000000000000  00000040  2**3
+                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
+4 action-mark   00000088  0000000000000000  0000000000000000  00000838  2**3
+                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
+5 action-rand   00000098  0000000000000000  0000000000000000  000008c0  2**3
+                CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
+6 maps          00000030  0000000000000000  0000000000000000  00000958  2**2
+                CONTENTS, ALLOC, LOAD, DATA
+7 license       00000004  0000000000000000  0000000000000000  00000988  2**0
+                CONTENTS, ALLOC, LOAD, DATA
+[...]
+.fi
+.in
+
+Adding an eBPF classifier from an object file that contains a classifier
+in the default ELF section is trivial (note that instead of "object-file"
+also shortcuts such as "obj" can be used):
+
+.in +4n
+.B bcc bpf.c
+.br
+.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1
+.in
+
+In case the classifier resides in ELF section "mycls", then that same
+command needs to be invoked as:
+
+.in +4n
+.B tc filter add dev em1 parent 1: bpf obj bpf.o sec mycls flowid 1:1
+.in
+
+Dumping the classifier configuration will tell the location of the
+classifier, in other words that it's from object file "bpf.o" under
+section "mycls":
+
+.in +4n
+.B tc filter show dev em1
+.br
+.B filter parent 1: protocol all pref 49152 bpf
+.br
+.B filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid 1:1 bpf.o:[mycls]
+.in
+
+The same program can also be installed on ingress qdisc side as opposed
+to egress ...
+
+.in +4n
+.B tc qdisc add dev em1 handle ffff: ingress
+.br
+.B tc filter add dev em1 parent ffff: bpf obj bpf.o sec mycls flowid ffff:1
+.in
+
+\&... and again dumped from there:
+
+.in +4n
+.B tc filter show dev em1 parent ffff:
+.br
+.B filter protocol all pref 49152 bpf
+.br
+.B filter protocol all pref 49152 bpf handle 0x1 flowid ffff:1 bpf.o:[mycls]
+.in
+
+Attaching a classifier and action on ingress has the restriction that
+it doesn't have an actual underlying queueing discipline. What ingress
+can do is to classify, mangle, redirect or drop packets. When queueing
+is required on ingress side, then ingress must redirect packets to the
+.B ifb
+device, otherwise policing can be used. Moreover, ingress can be used to
+have an early drop point of unwanted packets before they hit upper layers
+of the networking stack, perform network accounting with eBPF maps that
+could be shared with egress, or have an early mangle and/or redirection
+point to different networking devices.
+
+Multiple eBPF actions and classifier can be placed into a single
+object file within various sections. In that case, non-default section
+names must be provided, which is the case for both actions in this
+example:
+
+.in +4n
+.B tc filter add dev em1 parent 1: bpf obj bpf.o flowid 1:1 \e
+.br
+.in +25n
+.B                          action bpf obj bpf.o sec action-mark \e
+.br
+.B                          action bpf obj bpf.o sec action-rand ok
+.in -25n
+.in -4n
+
+The advantage of this is that the classifier and the two actions can
+then share eBPF maps with each other, if implemented in the programs.
+
+In order to access eBPF maps from user space beyond
+.B tc(8)
+setup lifetime, the ownership can be transferred to an eBPF agent via
+Unix domain sockets. There are two possibilities for implementing this:
+
+.B 1)
+implementation of an own eBPF agent that takes care of setting up
+the Unix domain socket and implementing the protocol that
+.B tc(8)
+dictates. A code example of this can be found inside the iproute2
+source package under:
+.B examples/bpf/
+
+.B 2)
+use
+.B tc exec
+for transferring the eBPF map file descriptors through a Unix domain
+socket, and spawning an application such as
+.B sh(1)
+\&. This approach's advantage is that tc will place the file descriptors
+into the environment and thus make them available just like stdin, stdout,
+stderr file descriptors, meaning, in case user applications run from within
+this fd-owner shell, they can terminate and restart without loosing eBPF
+maps file descriptors. Example invocation with the previous classifier and
+action mixture:
+
+.in +4n
+.B tc exec bpf imp /tmp/bpf
+.br
+.B tc filter add dev em1 parent 1: bpf obj bpf.o exp /tmp/bpf flowid 1:1 \e
+.br
+.in +25n
+.B                          action bpf obj bpf.o sec action-mark \e
+.br
+.B                          action bpf obj bpf.o sec action-rand ok
+.in -25n
+.in -4n
+
+Assuming that eBPF maps are shared with classifier and actions, it's
+enough to export them once, for example, from within the classifier
+or action command. tc will setup all eBPF map file descriptors at the
+time when the object file is first parsed.
+
+When a shell has been spawned, the environment will have a couple of
+eBPF related variables. BPF_NUM_MAPS provides the total number of maps
+that have been transferred over the Unix domain socket. BPF_MAP<X>'s
+value is the file descriptor number that can be accessed in eBPF agent
+applications, in other words, it can directly be used as the file
+descriptor value for the
+.B bpf(2)
+system call to retrieve or alter eBPF map values. <X> denotes the
+identifier of the eBPF map. It corresponds to the
+.B id
+member of
+.B struct bpf_elf_map
+\& from the tc eBPF map specification.
+
+The environment in this example looks as follows:
+
+.in +4n
+.nf
+.sp
+sh# env | grep BPF
+    BPF_NUM_MAPS=3
+    BPF_MAP1=6
+    BPF_MAP0=5
+    BPF_MAP2=7
+sh# ls -la /proc/self/fd
+    [...]
+    lrwx------. 1 root root 64 Apr 14 16:46 5 -> anon_inode:bpf-map
+    lrwx------. 1 root root 64 Apr 14 16:46 6 -> anon_inode:bpf-map
+    lrwx------. 1 root root 64 Apr 14 16:46 7 -> anon_inode:bpf-map
+sh# my_bpf_agent
+.fi
+.in
+
+eBPF agents are very useful in that they can prepopulate eBPF maps from
+user space, monitor statistics via maps and based on that feedback, for
+example, rewrite classids in eBPF map values during runtime. Given that eBPF
+agents are implemented as normal applications, they can also dynamically
+receive traffic control policies from external controllers and thus push
+them down into eBPF maps to dynamically adapt to network conditions. Moreover,
+eBPF maps can also be shared with other eBPF program types (e.g. tracing),
+thus very powerful combination can therefore be implemented.
+
+.SS eBPF PROGRAMMING
+
+eBPF classifier and actions are being implemented in restricted C syntax
+(in future, there could additionally be new language frontends supported).
+
+The header file
+.B linux/bpf.h
+provides eBPF helper functions that can be called from an eBPF program.
+This man page will only provide two minimal, stand-alone examples, have a
+look at
+.B examples/bpf
+from the iproute2 source package for a fully fledged flow dissector
+example to better demonstrate some of the possibilities with eBPF.
+
+Supported 32 bit classifier return codes from the C program and their meanings:
+.in +4n
+.B 0
+, denotes a mismatch
+.br
+.B -1
+, denotes the default classid configured from the command line
+.br
+.B else
+, everything else will override the default classid to provide a facility for
+non-linear matching
+.in
+
+Supported 32 bit action return codes from the C program and their meanings (
+.B linux/pkt_cls.h
+):
+.in +4n
+.B TC_ACT_OK (0)
+, will terminate the packet processing pipeline and allows the packet to
+proceed
+.br
+.B TC_ACT_SHOT (2)
+, will terminate the packet processing pipeline and drops the packet
+.br
+.B TC_ACT_UNSPEC (-1)
+, will use the default action configured from tc (similarly as returning
+.B -1
+from a classifier)
+.br
+.B TC_ACT_PIPE (3)
+, will iterate to the next action, if available
+.br
+.B TC_ACT_RECLASSIFY (1)
+, will terminate the packet processing pipeline and start classification
+from the beginning
+.br
+.B else
+, everything else is an unspecified return code
+.in
+
+Both classifier and action return codes are supported in eBPF and cBPF
+programs.
+
+To demonstrate restricted C syntax, a minimal toy classifier example is
+provided, which assumes that egress packets, for instance originating
+from a container, have previously been marked in interval [0, 255]. The
+program keeps statistics on different marks for user space and maps the
+classid to the root qdisc with the marking itself as the minor handle:
+
+.in +4n
+.nf
+.sp
+#include <stdint.h>
+#include <asm/types.h>
+
+#include <linux/bpf.h>
+#include <linux/pkt_sched.h>
+
+#include "helpers.h"
+
+struct tuple {
+        long packets;
+        long bytes;
+};
+
+#define BPF_MAP_ID_STATS        1 /* agent's map identifier */
+#define BPF_MAX_MARK            256
+
+struct bpf_elf_map __section("maps") map_stats = {
+        .type           =       BPF_MAP_TYPE_ARRAY,
+        .id             =       BPF_MAP_ID_STATS,
+        .size_key       =       sizeof(uint32_t),
+        .size_value     =       sizeof(struct tuple),
+        .max_elem       =       BPF_MAX_MARK,
+};
+
+static inline void cls_update_stats(const struct __sk_buff *skb,
+                                    uint32_t mark)
+{
+        struct tuple *tu;
+
+        tu = bpf_map_lookup_elem(&map_stats, &mark);
+        if (likely(tu)) {
+                __sync_fetch_and_add(&tu->packets, 1);
+                __sync_fetch_and_add(&tu->bytes, skb->len);
+        }
+}
+
+__section("cls") int cls_main(struct __sk_buff *skb)
+{
+        uint32_t mark = skb->mark;
+
+        if (unlikely(mark >= BPF_MAX_MARK))
+                return 0;
+
+        cls_update_stats(skb, mark);
+
+        return TC_H_MAKE(TC_H_ROOT, mark);
+}
+
+char __license[] __section("license") = "GPL";
+.fi
+.in
+
+Another small example is a port redirector which demuxes destination port
+80 into the interval [8080, 8087] steered by RSS, that can then be attached
+to ingress qdisc. The exercise of adding the egress counterpart and IPv6
+support is left to the reader:
+
+.in +4n
+.nf
+.sp
+#include <asm/types.h>
+#include <asm/byteorder.h>
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+
+#include "helpers.h"
+
+static inline void set_tcp_dport(struct __sk_buff *skb, int nh_off,
+                                 __u16 old_port, __u16 new_port)
+{
+        bpf_l4_csum_replace(skb, nh_off + offsetof(struct tcphdr, check),
+                            old_port, new_port, sizeof(new_port));
+        bpf_skb_store_bytes(skb, nh_off + offsetof(struct tcphdr, dest),
+                            &new_port, sizeof(new_port), 0);
+}
+
+static inline int lb_do_ipv4(struct __sk_buff *skb, int nh_off)
+{
+        __u16 dport, dport_new = 8080, off;
+        __u8 ip_proto, ip_vl;
+
+        ip_proto = load_byte(skb, nh_off +
+                             offsetof(struct iphdr, protocol));
+        if (ip_proto != IPPROTO_TCP)
+                return 0;
+
+        ip_vl = load_byte(skb, nh_off);
+        if (likely(ip_vl == 0x45))
+                nh_off += sizeof(struct iphdr);
+        else
+                nh_off += (ip_vl & 0xF) << 2;
+
+        dport = load_half(skb, nh_off + offsetof(struct tcphdr, dest));
+        if (dport != 80)
+                return 0;
+
+        off = skb->queue_mapping & 7;
+        set_tcp_dport(skb, nh_off - BPF_LL_OFF, __constant_htons(80),
+                      __cpu_to_be16(dport_new + off));
+        return -1;
+}
+
+__section("lb") int lb_main(struct __sk_buff *skb)
+{
+        int ret = 0, nh_off = BPF_LL_OFF + ETH_HLEN;
+
+        if (likely(skb->protocol == __constant_htons(ETH_P_IP)))
+                ret = lb_do_ipv4(skb, nh_off);
+
+        return ret;
+}
+
+char __license[] __section("license") = "GPL";
+.fi
+.in
+
+The related helper header file
+.B helpers.h
+in both examples was:
+
+.in +4n
+.nf
+.sp
+/* Misc helper macros. */
+#define __section(x) __attribute__((section(x), used))
+#define offsetof(x, y) __builtin_offsetof(x, y)
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+/* Used map structure */
+struct bpf_elf_map {
+    __u32 type;
+    __u32 size_key;
+    __u32 size_value;
+    __u32 max_elem;
+    __u32 id;
+};
+
+/* Some used BPF function calls. */
+static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from,
+                                  int len, int flags) =
+      (void *) BPF_FUNC_skb_store_bytes;
+static int (*bpf_l4_csum_replace)(void *ctx, int off, int from,
+                                  int to, int flags) =
+      (void *) BPF_FUNC_l4_csum_replace;
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+      (void *) BPF_FUNC_map_lookup_elem;
+
+/* Some used BPF intrinsics. */
+unsigned long long load_byte(void *skb, unsigned long long off)
+    asm ("llvm.bpf.load.byte");
+unsigned long long load_half(void *skb, unsigned long long off)
+    asm ("llvm.bpf.load.half");
+.fi
+.in
+
+Best practice, we recommend to only have a single eBPF classifier loaded
+in tc and perform
+.B all
+necessary matching and mangling from there instead of a list of individual
+classifier and separate actions. Just a single classifier tailored for a
+given use-case will be most efficient to run.
+
+.SS eBPF DEBUGGING
+
+Both tc
+.B filter
+and
+.B action
+commands for
+.B bpf
+support an optional
+.B verbose
+parameter that can be used to inspect the eBPF verifier log. It is dumped
+by default in case of an error.
+
+In case the eBPF/cBPF JIT compiler has been enabled, it can also be
+instructed to emit a debug output of the resulting opcode image into
+the kernel log, which can be read via
+.B dmesg(1)
+:
+
+.in +4n
+.B echo 2 > /proc/sys/net/core/bpf_jit_enable
+.in
+
+The Linux kernel source tree ships additionally under
+.B tools/net/
+a small helper called
+.B bpf_jit_disasm
+that reads out the opcode image dump from the kernel log and dumps the
+resulting disassembly:
+
+.in +4n
+.B bpf_jit_disasm -o
+.in
+
+Other than that, the Linux kernel also contains an extensive eBPF/cBPF
+test suite module called
+.B test_bpf
+\&. Upon ...
+
+.in +4n
+.B modprobe test_bpf
+.in
+
+\&... it performs a diversity of test cases and dumps the results into
+the kernel log that can be inspected with
+.B dmesg(1)
+\&. The results can differ depending on whether the JIT compiler is enabled
+or not. In case of failed test cases, the module will fail to load. In
+such cases, we urge you to file a bug report to the related JIT authors,
+Linux kernel and networking mailing lists.
+
+.SS cBPF
+
+Although we generally recommend switching to implementing
+.B eBPF
+classifier and actions, for the sake of completeness, a few words on how to
+program in cBPF will be lost here.
+
+Likewise, the
+.B bpf_jit_enable
+switch can be enabled as mentioned already. Tooling such as
+.B bpf_jit_disasm
+is also independent whether eBPF or cBPF code is being loaded.
+
+Unlike in eBPF, classifier and action are not implemented in restricted C,
+but rather in a minimal assembler-like language or with the help of other
+tooling.
+
+The raw interface with tc takes opcodes directly. For example, the most
+minimal classifier matching on every packet resulting in the default
+classid of 1:1 looks like:
+
+.in +4n
+.B tc filter add dev em1 parent 1: bpf bytecode '1,6 0 0 4294967295,' flowid 1:1
+.in
+
+The first decimal of the bytecode sequence denotes the number of subsequent
+4-tuples of cBPF opcodes. As mentioned, such a 4-tuple consists of
+.B c t f k
+decimals, where
+.B c
+represents the cBPF opcode,
+.B t
+the jump true offset target,
+.B f
+the jump false offset target and
+.B k
+the immediate constant/literal. Here, this denotes an unconditional return
+from the program with immediate value of -1.
+
+Thus, for egress classification, Willem de Bruijn implemented a minimal stand-alone
+helper tool under the GNU General Public License version 2 for
+.B iptables(8)
+BPF extension, which abuses the
+.B libpcap
+internal classic BPF compiler, his code derived here for usage with
+.B tc(8)
+:
+
+.in +4n
+.nf
+.sp
+#include <pcap.h>
+#include <stdio.h>
+
+int main(int argc, char **argv)
+{
+        struct bpf_program prog;
+        struct bpf_insn *ins;
+        int i, ret, dlt = DLT_RAW;
+
+        if (argc < 2 || argc > 3)
+                return 1;
+        if (argc == 3) {
+                dlt = pcap_datalink_name_to_val(argv[1]);
+                if (dlt == -1)
+                        return 1;
+        }
+
+        ret = pcap_compile_nopcap(-1, dlt, &prog, argv[argc - 1],
+                                  1, PCAP_NETMASK_UNKNOWN);
+        if (ret)
+                return 1;
+
+        printf("%d,", prog.bf_len);
+        ins = prog.bf_insns;
+
+        for (i = 0; i < prog.bf_len - 1; ++ins, ++i)
+                printf("%u %u %u %u,", ins->code,
+                       ins->jt, ins->jf, ins->k);
+        printf("%u %u %u %u",
+               ins->code, ins->jt, ins->jf, ins->k);
+
+        pcap_freecode(&prog);
+        return 0;
+}
+.fi
+.in
+
+Given this small helper, any
+.B tcpdump(8)
+filter expression can be abused as a classifier where a match will
+result in the default classid:
+
+.in +4n
+.B bpftool EN10MB 'tcp[tcpflags] & tcp-syn != 0' > /var/bpf/tcp-syn
+.br
+.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
+.in
+
+Basically, such a minimal generator is equivalent to:
+
+.in +4n
+.B tcpdump -iem1 -ddd 'tcp[tcpflags] & tcp-syn != 0' | tr '\\n' ',' > /var/bpf/tcp-syn
+.in
+
+Since
+.B libpcap
+does not support all Linux' specific cBPF extensions in its compiler, the
+Linux kernel also ships under
+.B tools/net/
+a minimal BPF assembler called
+.B bpf_asm
+for providing full control. For detailed syntax and semantics on implementing
+such programs by hand, see references under
+.B FURTHER READING
+\&.
+
+Trivial toy example in
+.B bpf_asm
+for classifying IPv4/TCP packets, saved in a text file called
+.B foobar
+:
+
+.in +4n
+.nf
+.sp
+ldh [12]
+jne #0x800, drop
+ldb [23]
+jneq #6, drop
+ret #-1
+drop: ret #0
+.fi
+.in
+
+Similarly, such a classifier can be loaded as:
+
+.in +4n
+.B bpf_asm foobar > /var/bpf/tcp-syn
+.br
+.B tc filter add dev em1 parent 1: bpf bytecode-file /var/bpf/tcp-syn flowid 1:1
+.in
+
+For BPF classifiers, the Linux kernel provides additionally under
+.B tools/net/
+a small BPF debugger called
+.B bpf_dbg
+, which can be used to test a classifier against pcap files, single-step
+or add various breakpoints into the classifier program and dump register
+contents during runtime.
+
+Implementing an action in classic BPF is rather limited in the sense that
+packet mangling is not supported. Therefore, it's generally recommended to
+make the switch to eBPF, whenever possible.
+
+.SH FURTHER READING
+Further and more technical details about the BPF architecture can be found
+in the Linux kernel source tree under
+.B Documentation/networking/filter.txt
+\&.
+
+Further details on eBPF
+.B tc(8)
+examples can be found in the iproute2 source
+tree under
+.B examples/bpf/
+\&.
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-ematch (8)
+.BR bpf (2)
+.BR bpf (4)
+
+.SH AUTHORS
+Manpage written by Daniel Borkmann.
+
+Please report corrections or improvements to the Linux kernel networking
+mailing list:
+.B <netdev@...r.kernel.org>
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 434fe6c..feafa05 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -558,6 +558,7 @@ Shows classes as ASCII graph with stats info under each class.
 was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .SH SEE ALSO
 .BR tc-bfifo (8),
+.BR tc-bpf (8),
 .BR tc-cbq (8),
 .BR tc-choke (8),
 .BR tc-codel (8),
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ