netdev - Re: [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers (01-11)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 10 Apr 2018 10:56:07 -0700
From:   Alexei Starovoitov <alexei.starovoitov@...il.com>
To:     Quentin Monnet <quentin.monnet@...ronome.com>
Cc:     daniel@...earbox.net, ast@...nel.org, netdev@...r.kernel.org,
        oss-drivers@...ronome.com, linux-doc@...r.kernel.org,
        linux-man@...r.kernel.org
Subject: Re: [RFC bpf-next v2 2/8] bpf: add documentation for eBPF helpers
 (01-11)

On Tue, Apr 10, 2018 at 03:41:51PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Alexei:
> 
> - bpf_map_lookup_elem()
> - bpf_map_update_elem()
> - bpf_map_delete_elem()
> - bpf_probe_read()
> - bpf_ktime_get_ns()
> - bpf_trace_printk()
> - bpf_skb_store_bytes()
> - bpf_l3_csum_replace()
> - bpf_l4_csum_replace()
> - bpf_tail_call()
> - bpf_clone_redirect()
> 
> Cc: Alexei Starovoitov <ast@...nel.org>
> Signed-off-by: Quentin Monnet <quentin.monnet@...ronome.com>
> ---
>  include/uapi/linux/bpf.h | 199 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 199 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 45f77f01e672..2bc653a3a20f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -381,6 +381,205 @@ union bpf_attr {
>   * intentional, removing them would break paragraphs for rst2man.
>   *
>   * Start of BPF helper function descriptions:
> + *
> + * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
> + * 	Description
> + * 		Perform a lookup in *map* for an entry associated to *key*.
> + * 	Return
> + * 		Map value associated to *key*, or **NULL** if no entry was
> + * 		found.
> + *
> + * int bpf_map_update_elem(struct bpf_map *map, void *key, void *value, u64 flags)
> + * 	Description
> + * 		Add or update the value of the entry associated to *key* in
> + * 		*map* with *value*. *flags* is one of:
> + *
> + * 		**BPF_NOEXIST**
> + * 			The entry for *key* must not exist in the map.
> + * 		**BPF_EXIST**
> + * 			The entry for *key* must already exist in the map.
> + * 		**BPF_ANY**
> + * 			No condition on the existence of the entry for *key*.
> + *
> + * 		These flags are only useful for maps of type
> + * 		**BPF_MAP_TYPE_HASH**. For all other map types, **BPF_ANY**
> + * 		should be used.

I think that's not entirely accurate.
The flags work as expected for all other map types as well
and for lru map, sockmap, map in map the flags have practical use cases.

> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_map_delete_elem(struct bpf_map *map, void *key)
> + * 	Description
> + * 		Delete entry with *key* from *map*.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_probe_read(void *dst, u32 size, const void *src)
> + * 	Description
> + * 		For tracing programs, safely attempt to read *size* bytes from
> + * 		address *src* and store the data in *dst*.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * u64 bpf_ktime_get_ns(void)
> + * 	Description
> + * 		Return the time elapsed since system boot, in nanoseconds.
> + * 	Return
> + * 		Current *ktime*.
> + *
> + * int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
> + * 	Description
> + * 		This helper is a "printk()-like" facility for debugging. It
> + * 		prints a message defined by format *fmt* (of size *fmt_size*)
> + * 		to file *\/sys/kernel/debug/tracing/trace* from DebugFS, if
> + * 		available. It can take up to three additional **u64**
> + * 		arguments (as an eBPF helpers, the total number of arguments is
> + * 		limited to five). Each time the helper is called, it appends a
> + * 		line that looks like the following:
> + *
> + * 		::
> + *
> + * 			telnet-470   [001] .N.. 419421.045894: 0x00000001: BPF command: 2
> + *
> + * 		In the above:
> + *
> + * 			* ``telnet`` is the name of the current task.
> + * 			* ``470`` is the PID of the current task.
> + * 			* ``001`` is the CPU number on which the task is
> + * 			  running.
> + * 			* In ``.N..``, each character refers to a set of
> + * 			  options (whether irqs are enabled, scheduling
> + * 			  options, whether hard/softirqs are running, level of
> + * 			  preempt_disabled respectively). **N** means that
> + * 			  **TIF_NEED_RESCHED** and **PREEMPT_NEED_RESCHED**
> + * 			  are set.
> + * 			* ``419421.045894`` is a timestamp.
> + * 			* ``0x00000001`` is a fake value used by BPF for the
> + * 			  instruction pointer register.
> + * 			* ``BPF command: 2`` is the message formatted with
> + * 			  *fmt*.

the above depends on how trace_pipe was configured. It's a default
configuration for many, but would be good to explain this a bit better.

> + *
> + * 		The conversion specifiers supported by *fmt* are similar, but
> + * 		more limited than for printk(). They are **%d**, **%i**,
> + * 		**%u**, **%x**, **%ld**, **%li**, **%lu**, **%lx**, **%lld**,
> + * 		**%lli**, **%llu**, **%llx**, **%p**, **%s**. No modifier (size
> + * 		of field, padding with zeroes, etc.) is available, and the
> + * 		helper will silently fail if it encounters an unknown
> + * 		specifier.

This is not true. bpf_trace_printk will return -EINVAL for unknown specifier.

> + *
> + * 		Also, note that **bpf_trace_printk**\ () is slow, and should
> + * 		only be used for debugging purposes. For passing values to user
> + * 		space, perf events should be preferred.

please mention the giant dmesg warning that people will definitely
notice when they try to use this helper.

> + * 	Return
> + * 		The number of bytes written to the buffer, or a negative error
> + * 		in case of failure.
> + *
> + * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags)
> + * 	Description
> + * 		Store *len* bytes from address *from* into the packet
> + * 		associated to *skb*, at *offset*. *flags* are a combination of
> + * 		**BPF_F_RECOMPUTE_CSUM** (automatically recompute the
> + * 		checksum for the packet after storing the bytes) and
> + * 		**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + * 		**->swhash** and *skb*\ **->l4hash** to 0).
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l3_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 size)
> + * 	Description
> + * 		Recompute the IP checksum for the packet associated to *skb*.
> + * 		Computation is incremental, so the helper must know the former
> + * 		value of the header field that was modified (*from*), the new
> + * 		value of this field (*to*), and the number of bytes (2 or 4)
> + * 		for this field, stored in *size*. Alternatively, it is possible
> + * 		to store the difference between the previous and the new values
> + * 		of the header field in *to*, by setting *from* and *size* to 0.
> + * 		For both methods, *offset* indicates the location of the IP
> + * 		checksum within the packet.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_l4_csum_replace(struct sk_buff *skb, u32 offset, u64 from, u64 to, u64 flags)
> + * 	Description
> + * 		Recompute the TCP or UDP checksum for the packet associated to
> + * 		*skb*. Computation is incremental, so the helper must know the
> + * 		former value of the header field that was modified (*from*),
> + * 		the new value of this field (*to*), and the number of bytes (2
> + * 		or 4) for this field, stored on the lowest four bits of
> + * 		*flags*. Alternatively, it is possible to store the difference
> + * 		between the previous and the new values of the header field in
> + * 		*to*, by setting *from* and the four lowest bits of *flags* to
> + * 		0. For both methods, *offset* indicates the location of the IP
> + * 		checksum within the packet. In addition to the size of the
> + * 		field, *flags* can be added (bitwise OR) actual flags. With
> + * 		**BPF_F_MARK_MANGLED_0**, a null checksum is left untouched
> + * 		(unless **BPF_F_MARK_ENFORCE** is added as well), and for
> + * 		updates resulting in a null checksum the value is set to
> + * 		**CSUM_MANGLED_0** instead. Flag **BPF_F_PSEUDO_HDR**
> + * 		indicates the checksum is to be computed against a
> + * 		pseudo-header.
> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)
> + * 	Description
> + * 		This special helper is used to trigger a "tail call", or in
> + * 		other words, to jump into another eBPF program. The contents of
> + * 		eBPF registers and stack are not modified, the new program
> + * 		"inherits" them from the caller. This mechanism allows for

"inherits" is a technically correct, but misleading statement,
since callee program cannot access caller's registers and stack.

> + * 		program chaining, either for raising the maximum number of
> + * 		available eBPF instructions, or to execute given programs in
> + * 		conditional blocks. For security reasons, there is an upper
> + * 		limit to the number of successive tail calls that can be
> + * 		performed.
> + *
> + * 		Upon call of this helper, the program attempts to jump into a
> + * 		program referenced at index *index* in *prog_array_map*, a
> + * 		special map of type **BPF_MAP_TYPE_PROG_ARRAY**, and passes
> + * 		*ctx*, a pointer to the context.
> + *
> + * 		If the call succeeds, the kernel immediately runs the first
> + * 		instruction of the new program. This is not a function call,
> + * 		and it never goes back to the previous program. If the call
> + * 		fails, then the helper has no effect, and the caller continues
> + * 		to run its own instructions. A call can fail if the destination
> + * 		program for the jump does not exist (i.e. *index* is superior
> + * 		to the number of entries in *prog_array_map*), or if the
> + * 		maximum number of tail calls has been reached for this chain of
> + * 		programs. This limit is defined in the kernel by the macro
> + * 		**MAX_TAIL_CALL_CNT** (not accessible to user space), which
> + * 		is currently set to 32.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_clone_redirect(struct sk_buff *skb, u32 ifindex, u64 flags)
> + * 	Description
> + * 		Clone and redirect the packet associated to *skb* to another
> + * 		net device of index *ifindex*. The only flag supported for now
> + * 		is **BPF_F_INGRESS**, which indicates the packet is to be
> + * 		redirected to the ingress interface instead of (by default)
> + * 		egress.

imo the above sentence is prone to misinterpretation.
Can you rephrase it to say that both redirect to ingress and redirect to egress
are supported and flag is used to indicate which path to take ?

> + *
> + * 		A call to this helper is susceptible to change data from the
> + * 		packet. Therefore, at load time, all checks on pointers
> + * 		previously done by the verifier are invalidated and must be
> + * 		performed again.
> + * 	Return
> + * 		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> -- 
> 2.14.1
>