[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55B0E252.2010207@iogearbox.net>
Date: Thu, 23 Jul 2015 14:47:14 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: "Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>,
Alexei Starovoitov <ast@...mgrid.com>
CC: linux-man <linux-man@...r.kernel.org>,
linux-kernel@...r.kernel.org, Silvan Jegen <s.jegen@...il.com>,
Walter Harms <wharms@....de>
Subject: Re: Draft 3 of bpf(2) man page for review
On 07/23/2015 01:23 PM, Michael Kerrisk (man-pages) wrote:
...
>> Btw, a user obviously can close() the map fds if he
>> wants to, but ultimatively they're freed when the program unloads.
>
> Okay. (Not sure if you meant that something should be added to the page.)
I think not necessary.
[...]
>>> The attributes key_size and value_size will be used by the
>>
>> attribute's?
>
> Nope. But I changed this to "The key_size and value_size attributes will be",
> which may read clearer.
Sorry, true, I was a bit confused. :)
[...]
>> The type __u64 is kernel internal, so if there's no strict reason to use it,
>> we should just use what's provided by stdint.h.
>
> Agreed. Done. (By the way, what about all the __u32 and __u64 elements in the
> bpf_attr union?)
I wouldn't change the bpf_attr from the uapi.
Just the provided example code here, I presume people might copy from here when
they build their own library and in userspace uint64_t seems to be more natural.
[...]
>>> * map_update_elem() replaces elements in an non-atomic
>>> fashion; for atomic updates, a hash-table map should be
>>> used instead.
>>
>> This point here is most important, i.e. to not have false user expecations.
>> Maybe it's also worth mentioning that when you have a value_size of sizeof(long),
>> you can however use __sync_fetch_and_add() atomic builtin from the LLVM backend.
>
> I think I'll leave out that detail for the moment.
Ok, I guess we could revisit/clarify that at a later point in time. I'd add
a TODO comment to the source or the like, as this also is related to the 2nd
below use case (aggregation/accounting), where an array is typically used.
>>> Among the uses for array maps are the following:
>>>
>>> * As "global" eBPF variables: an array of 1 element whose
>>> key is (index) 0 and where the value is a collection of
>>> 'global' variables which eBPF programs can use to keep
>>> state between events.
>>>
>>> * Aggregation of tracing events into a fixed set of buck‐
>>> ets.
[...]
>>> * license is a license string, which must be GPL compatible to
>>> call helper functions marked gpl_only.
>>
>> Not strictly. So here, the same rules apply as with kernel modules. I.e. what
>> the kernel checks for are the following license strings:
>>
>> static inline int license_is_gpl_compatible(const char *license)
>> {
>> return (strcmp(license, "GPL") == 0
>> || strcmp(license, "GPL v2") == 0
>> || strcmp(license, "GPL and additional rights") == 0
>> || strcmp(license, "Dual BSD/GPL") == 0
>> || strcmp(license, "Dual MIT/GPL") == 0
>> || strcmp(license, "Dual MPL/GPL") == 0);
>> }
>>
>> With any of them, the eBPF program is declared GPL compatible. Maybe of interest
>> for those that want to use dual licensing of some sort.
>
> So, I'm a little unclear here. What text do you suggest for the page?
Maybe we should mention in addition that the same licensing rules apply as
in case with kernel modules, so also dual licenses could be used.
>>> * log_buf is a pointer to a caller-allocated buffer in which the
>>> in-kernel verifier can store the verification log. This log
>>> is a multi-line string that can be checked by the program
>>> author in order to understand how the verifier came to the
>>> conclusion that the BPF program is unsafe. The format of the
>>> output can change at any time as the verifier evolves.
>>>
>>> * log_size size of the buffer pointed to by log_bug. If the
>>> size of the buffer is not large enough to store all verifier
>>> messages, -1 is returned and errno is set to ENOSPC.
>>>
>>> * log_level verbosity level of the verifier. A value of zero
>>> means that the verifier will not provide a log.
>>
>> Note that the log buffer is optional as mentioned here log_level = 0. The
>> above example code of bpf_prog_load() suggests that it always needs to be
>> provided.
>>
>> I once ran indeed into an issue where the program itself was correct, but
>> it got rejected by the kernel, because my log buffer size was too small, so
>> in tc, we now have it larger as bpf_log_buf[65536] ...
>
> So, I'm not clear. Do you mean that some piece of text here in the page
> should be changed? If so, could elaborate?
I'd maybe only mention in addition that in log_level=0 case, we also must not
provide a log_buf and log_size, otherwise we get EINVAL.
[...]
>> I had to read this twice. ;) Maybe this needs to be reworded slightly.
>>
>> It just means that depending on the program type that the author selects,
>> you might end up with a different subset of helper functions, and a
>> different program input/context. For example tracing does not have the
>> exact same helpers as socket filters (it might have some that can be used
>> by both). Also, the eBPF program input (context) for socket filters is a
>> network packet, wheras for tracing you operate on a set of registers.
>
> Changed. Now we have:
>
> eBPF program types
> The eBPF program type (prog_type) determines the subset of a ker‐
> nel helper functions that the program may call. The program type
s/a//
> also determines dthe program input (context)—the format of struct
s/dthe/the/
> bpf_context (which is the data blob passed into the eBPF program
> as the first argument).
>
> For example, a tracing program does not have the exact same sub‐
> set of helper functions as a socket filter program (though they
> may have some helpers in common). Similarly, the input (context)
> for a tracing program is a set of register values, while for a
> socket filter it is a network packet.
>
> The set of functions available to eBPF programs of a given type
> may increase in the future.
That's fine with me.
[...]
>> I would also make a note about the JIT compiler here, i.e. that it's disabled
>> by default, and can be enabled via:
>>
>> * Normal mode: echo 1 > /proc/sys/net/core/bpf_jit_enable
>>
>> * Debugging mode: echo 2 > /proc/sys/net/core/bpf_jit_enable
>> [opcodes dumped in hex into the kernel log, which can then be disassembled
>
> Here, I assume you mean thet the generated (native) opcodes are dumpeed, right?
Yes.
>> with tools/net/bpf_jit_disasm.c from the kernel tree]
>>
>> When enabled, after a eBPF program gets loaded, it's transparently compiled /
>> translated inside the kernel into machine opcodes for better performance,
>> currently on x86_64, arm64 and s390.
>
> According to Documentation/networking/filter.txt the JIT compiler supports
> many more architectures:
>
> The Linux kernel has a built-in BPF JIT compiler for x86_64,
> SPARC, PowerPC, ARM, ARM64, MIPS and s390 and can be enabled
> through CONFIG_BPF_JIT.
>
> Or am I misunderstanding something?
The others only work for cBPF and have not (yet) be converted over to eBPF.
For the three mentioned above, the kernel internally migrates cBPF into eBPF
instructions and then JITs the eBPF result eventually.
> I added the following:
>
> The kernel contains a just-in-time (JIT) compiler that translates
> eBPF bytecode into native machine code for better performance.
> The JIT compiler is disabled by default, but its operation can be
> controlled by writing one of the following values to
> /proc/sys/net/core/bpf_jit_enable:
>
> 0 Disable JIT compilation (default).
>
> 1 Normal compilation.
>
> 2 Debugging mode. The generated opcodes are dumped in hexadeci‐
> mal into the kernel log. These opcodes can then be disassem‐
> bled using the program tools/net/bpf_jit_disasm.c provided in
> the kernel source tree.
>
>>> SEE ALSO
>>> seccomp(2), socket(7), tc(8), tc-bpf(8)
>>>
>>> Both classic and extended BPF are explained in the kernel source
>>> file Documentation/networking/filter.txt.
>>>
>>
Rest looks good for an initial version!
Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists