linux-kernel - Re: Draft 3 of bpf(2) man page for review

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55B0EDD7.8020407@gmail.com>
Date:	Thu, 23 Jul 2015 15:36:23 +0200
From:	"Michael Kerrisk (man-pages)" <mtk.manpages@...il.com>
To:	Daniel Borkmann <daniel@...earbox.net>,
	Alexei Starovoitov <ast@...mgrid.com>
CC:	mtk.manpages@...il.com, linux-man <linux-man@...r.kernel.org>,
	linux-kernel@...r.kernel.org, Silvan Jegen <s.jegen@...il.com>,
	Walter Harms <wharms@....de>
Subject: Re: Draft 3 of bpf(2) man page for review

Hi Daniel,

On 07/23/2015 02:47 PM, Daniel Borkmann wrote:
> On 07/23/2015 01:23 PM, Michael Kerrisk (man-pages) wrote:
> ...
>>> Btw, a user obviously can close() the map fds if he
>>> wants to, but ultimatively they're freed when the program unloads.
>>
>> Okay. (Not sure if you meant that something should be added to the page.)
> 
> I think not necessary.

Okay.


> [...]
>>>>                 The attributes key_size and value_size will be used by the
>>>
>>> attribute's?
>>
>> Nope. But I changed this to "The key_size and value_size attributes will be",
>> which may read clearer.
> 
> Sorry, true, I was a bit confused. :)

NP.

> [...]
>>> The type __u64 is kernel internal, so if there's no strict reason to use it,
>>> we should just use what's provided by stdint.h.
>>
>> Agreed. Done. (By the way, what about all the __u32 and __u64 elements in the
>> bpf_attr union?)
> 
> I wouldn't change the bpf_attr from the uapi.

Okay.

> Just the provided example code here, I presume people might copy from here when
> they build their own library and in userspace uint64_t seems to be more natural.

Yup.

> [...]
>>>>                 *  map_update_elem()  replaces  elements  in an non-atomic
>>>>                    fashion; for atomic updates, a hash-table map should be
>>>>                    used instead.
>>>
>>> This point here is most important, i.e. to not have false user expecations.
>>> Maybe it's also worth mentioning that when you have a value_size of sizeof(long),
>>> you can however use __sync_fetch_and_add() atomic builtin from the LLVM backend.
>>
>> I think I'll leave out that detail for the moment.
> 
> Ok, I guess we could revisit/clarify that at a later point in time. I'd add
> a TODO comment to the source or the like, as this also is related to the 2nd
> below use case (aggregation/accounting), where an array is typically used.

Okay. FIXME added.

>>>>                 Among the uses for array maps are the following:
>>>>
>>>>                 *  As "global" eBPF variables: an array of 1 element whose
>>>>                    key is (index) 0 and where the value is a collection of
>>>>                    'global'  variables which eBPF programs can use to keep
>>>>                    state between events.
>>>>
>>>>                 *  Aggregation of tracing events into a fixed set of buck‐
>>>>                    ets.
> 
> [...]
>>>>          *  license is a license string, which must be GPL  compatible  to
>>>>             call helper functions marked gpl_only.
>>>
>>> Not strictly. So here, the same rules apply as with kernel modules. I.e. what
>>> the kernel checks for are the following license strings:
>>>
>>> static inline int license_is_gpl_compatible(const char *license)
>>> {
>>> 	return (strcmp(license, "GPL") == 0
>>> 		|| strcmp(license, "GPL v2") == 0
>>> 		|| strcmp(license, "GPL and additional rights") == 0
>>> 		|| strcmp(license, "Dual BSD/GPL") == 0
>>> 		|| strcmp(license, "Dual MIT/GPL") == 0
>>> 		|| strcmp(license, "Dual MPL/GPL") == 0);
>>> }
>>>
>>> With any of them, the eBPF program is declared GPL compatible. Maybe of interest
>>> for those that want to use dual licensing of some sort.
>>
>> So, I'm a little unclear here. What text do you suggest for the page?
> 
> Maybe we should mention in addition that the same licensing rules apply as
> in case with kernel modules, so also dual licenses could be used.

Done.

>>>>          *  log_buf is a pointer to a caller-allocated buffer in which the
>>>>             in-kernel verifier can store the verification log.   This  log
>>>>             is  a  multi-line  string  that  can be checked by the program
>>>>             author in order to understand how the  verifier  came  to  the
>>>>             conclusion  that the BPF program is unsafe.  The format of the
>>>>             output can change at any time as the verifier evolves.
>>>>
>>>>          *  log_size size of the buffer pointed to  by  log_bug.   If  the
>>>>             size  of  the buffer is not large enough to store all verifier
>>>>             messages, -1 is returned and errno is set to ENOSPC.
>>>>
>>>>          *  log_level verbosity level of the verifier.  A  value  of  zero
>>>>             means that the verifier will not provide a log.
>>>
>>> Note that the log buffer is optional as mentioned here log_level = 0. The
>>> above example code of bpf_prog_load() suggests that it always needs to be
>>> provided.
>>>
>>> I once ran indeed into an issue where the program itself was correct, but
>>> it got rejected by the kernel, because my log buffer size was too small, so
>>> in tc, we now have it larger as bpf_log_buf[65536] ...
>>
>> So, I'm not clear. Do you mean that some piece of text here in the page
>> should be changed? If so, could elaborate?
> 
> I'd maybe only mention in addition that in log_level=0 case, we also must not
> provide a log_buf and log_size, otherwise we get EINVAL.

I changed the text to:

       *  log_level  verbosity  level  of the verifier.  A value of zero
          means that the verifier will not provide a log; in this  case,
          log_buf must be a NULL pointer, and log_size must be zero.

> [...]
>>> I had to read this twice. ;) Maybe this needs to be reworded slightly.
>>>
>>> It just means that depending on the program type that the author selects,
>>> you might end up with a different subset of helper functions, and a
>>> different program input/context. For example tracing does not have the
>>> exact same helpers as socket filters (it might have some that can be used
>>> by both). Also, the eBPF program input (context) for socket filters is a
>>> network packet, wheras for tracing you operate on a set of registers.
>>
>> Changed. Now we have:
>>
>>     eBPF program types
>>         The eBPF program type (prog_type) determines the subset of a ker‐
>>         nel helper functions that the program may call.  The program type
> 
> s/a//

Fixed.

>>         also determines dthe program input (context)—the format of struct
> 
> s/dthe/the/

Fixed.

>>         bpf_context (which is the data blob passed into the eBPF  program
>>         as the first argument).
>>
>>         For  example, a tracing program does not have the exact same sub‐
>>         set of helper functions as a socket filter program  (though  they
>>         may have some helpers in common).  Similarly, the input (context)
>>         for a tracing program is a set of register values,  while  for  a
>>         socket filter it is a network packet.
>>
>>         The  set  of functions available to eBPF programs of a given type
>>         may increase in the future.
> 
> That's fine with me.

Okay.

> [...]
>>> I would also make a note about the JIT compiler here, i.e. that it's disabled
>>> by default, and can be enabled via:
>>>
>>> * Normal mode: echo 1 > /proc/sys/net/core/bpf_jit_enable
>>>
>>> * Debugging mode: echo 2 > /proc/sys/net/core/bpf_jit_enable
>>>     [opcodes dumped in hex into the kernel log, which can then be disassembled
>>
>> Here, I assume you mean thet the generated (native) opcodes are dumpeed, right?
> 
> Yes.
> 
>>>      with tools/net/bpf_jit_disasm.c from the kernel tree]
>>>
>>> When enabled, after a eBPF program gets loaded, it's transparently compiled /
>>> translated inside the kernel into machine opcodes for better performance,
>>> currently on x86_64, arm64 and s390.
>>
>> According to Documentation/networking/filter.txt the JIT compiler supports
>> many more architectures:
>>
>>      The Linux kernel has a built-in BPF JIT compiler for x86_64,
>>      SPARC, PowerPC, ARM, ARM64, MIPS and s390 and can be enabled
>>      through CONFIG_BPF_JIT.
>>
>> Or am I misunderstanding something?
> 
> The others only work for cBPF and have not (yet) be converted over to eBPF.
> 
> For the three mentioned above, the kernel internally migrates cBPF into eBPF
> instructions and then JITs the eBPF result eventually.

Thanks for clearing that up -- I added the following sentence

    JIT compiler for eBPF is currently available for the x86-64, arm64,
    and s390 architectures.

Okay?

> 
>> I added the following:
>>
>>         The kernel contains a just-in-time (JIT) compiler that translates
>>         eBPF  bytecode  into  native machine code for better performance.
>>         The JIT compiler is disabled by default, but its operation can be
>>         controlled   by   writing   one   of   the  following  values  to
>>         /proc/sys/net/core/bpf_jit_enable:
>>
>>         0  Disable JIT compilation (default).
>>
>>         1  Normal compilation.
>>
>>         2  Debugging mode.  The generated opcodes are dumped in hexadeci‐
>>            mal  into the kernel log.  These opcodes can then be disassem‐
>>            bled using the program tools/net/bpf_jit_disasm.c provided  in
>>            the kernel source tree.
>>
>>>> SEE ALSO
>>>>          seccomp(2), socket(7), tc(8), tc-bpf(8)
>>>>
>>>>          Both classic and extended BPF are explained in the kernel  source
>>>>          file Documentation/networking/filter.txt.
>>>>
>>>
> 
> Rest looks good for an initial version!

Yup!

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/