[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84c86483-9c00-0ace-ca95-26613e25d7a2@fb.com>
Date: Fri, 1 Mar 2019 23:02:41 +0000
From: Yonghong Song <yhs@...com>
To: Daniel Borkmann <daniel@...earbox.net>,
Alexei Starovoitov <ast@...com>
CC: "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"joe@...d.net.nz" <joe@...d.net.nz>,
"john.fastabend@...il.com" <john.fastabend@...il.com>,
"tgraf@...g.ch" <tgraf@...g.ch>, Andrii Nakryiko <andriin@...com>,
"jakub.kicinski@...ronome.com" <jakub.kicinski@...ronome.com>,
"lmb@...udflare.com" <lmb@...udflare.com>
Subject: Re: [PATCH bpf-next v2 1/7] bpf: implement lookup-free direct value
access
On 3/1/19 11:51 AM, Daniel Borkmann wrote:
> On 03/01/2019 06:18 PM, Yonghong Song wrote:
>> On 2/28/19 3:18 PM, Daniel Borkmann wrote:
>>> This generic extension to BPF maps allows for directly loading an
>>> address residing inside a BPF map value as a single BPF ldimm64
>>> instruction.
>>>
>>> The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
>>> is a special src_reg flag for ldimm64 instruction that indicates
>>> that inside the first part of the double insns's imm field is a
>>> file descriptor which the verifier then replaces as a full 64bit
>>> address of the map into both imm parts.
>>>
>>> For the newly added BPF_PSEUDO_MAP_VALUE src_reg flag, the idea
>>> is similar: the first part of the double insns's imm field is
>>> again a file descriptor corresponding to the map, and the second
>>> part of the imm field is an offset. The verifier will then replace
>>> both imm parts with an address that points into the BPF map value
>>> for maps that support this operation. BPF_PSEUDO_MAP_VALUE is a
>>> distinct flag as otherwise with BPF_PSEUDO_MAP_FD we could not
>>> differ offset 0 between load of map pointer versus load of map's
>>> value at offset 0.
>>>
>>> This allows for efficiently retrieving an address to a map value
>>> memory area without having to issue a helper call which needs to
>>> prepare registers according to calling convention, etc, without
>>> needing the extra NULL test, and without having to add the offset
>>> in an additional instruction to the value base pointer.
>>>
>>> The verifier then treats the destination register as PTR_TO_MAP_VALUE
>>> with constant reg->off from the user passed offset from the second
>>> imm field, and guarantees that this is within bounds of the map
>>> value. Any subsequent operations are normally treated as typical
>>> map value handling without anything else needed for verification.
>>>
>>> The two map operations for direct value access have been added to
>>> array map for now. In future other types could be supported as
>>> well depending on the use case. The main use case for this commit
>>> is to allow for BPF loader support for global variables that
>>> reside in .data/.rodata/.bss sections such that we can directly
>>> load the address of them with minimal additional infrastructure
>>> required. Loader support has been added in subsequent commits for
>>> libbpf library.
>>
>> The patch version #1 provides a way to replace the load with
>> immediate (presumably read-only data). This will be good for
>> the use case like below:
>>
>> if (static_variable_kernel_version == V1) {
>> /* code here will work for kernel V1 */
>> ... access helpers available for V1 ...
>> } else if (static_variable_kernel_version == V2) {
>> /* code here will work for kernel V2 */
>> ... access helpers available for V2 ...
>> }
>>
>> The approach here did not replace the map value access with values from
>> e.g., readonly section for which libbpf could provide an interface to
>> fill in data from user.
>>
>> This may require a little more analysis, e.g.,
>> ptr = ld_imm64 from a readonly section
>> ...
>> *(u32 *)ptr;
>> *(u64 *)(ptr + 8);
>> ...
>>
>> Do you think we could do this in kernel verifier or we should
>> push the whole readonly stuff into user space?
>
> And in your case the static_variable_kernel_version would be determined
> at runtime, for example, where you then would want to eliminate all the
> other branches, right? Meaning, you'd need a way to turn this into a imm
> load such that verifier will detect these dead branches and patch them
Yes, the program will be compiled once and deployed to many hosts with
different kernel versions. Different hosts may have different kernel
versions. The static_variable_kernel_version is determined
> out, which it should already be able to do. How would you mark these
> special vars like static_variable_kernel_version such that they have
> special treatment from the rest, some sort of builtin? Potentially one
A libbpf API is needed to assign a particular value to a readonly
section value. For example, a bpf program may look like:
-bash-4.4$ cat g1.c
static volatile const unsigned __kernel_version;
int prog() {
unsigned kernel_ver = __kernel_version;
if (kernel_ver == 411)
return 0;
else if (kernel_ver == 416)
return 1;
return 2;
}
-bash-4.4$ clang -target bpf -O2 -c g1.c
-bash-4.4$ llvm-readelf -r g1.o
Relocation section '.rel.text' at offset 0x178 contains 1 entries:
Offset Info Type Symbol's
Value Symbol's Name
0000000000000000 0000000500000001 R_BPF_64_64
0000000000000000 .rodata
-bash-4.4$ llvm-objdump -d g1.o
g1.o: file format ELF64-BPF
Disassembly of section .text:
0000000000000000 prog:
0: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00
r1 = 0 ll
2: 61 11 00 00 00 00 00 00 r1 = *(u32 *)(r1 + 0)
3: b7 02 00 00 01 00 00 00 r2 = 1
4: 15 01 01 00 a0 01 00 00 if r1 == 416 goto +1
<LBB0_2>
5: b7 02 00 00 02 00 00 00 r2 = 2
0000000000000030 LBB0_2:
6: b7 00 00 00 00 00 00 00 r0 = 0
7: 15 01 01 00 9b 01 00 00 if r1 == 411 goto +1
<LBB0_4>
8: bf 20 00 00 00 00 00 00 r0 = r2
0000000000000048 LBB0_4:
9: 95 00 00 00 00 00 00 00 exit
-bash-4.4$ llvm-readelf -S g1.o
There are 9 section headers, starting at offset 0x1f8:
Section Headers:
[Nr] Name Type Address Off Size
ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000
00 0 0 0
[ 1] .strtab STRTAB 0000000000000000 000189 000068
00 0 0 1
[ 2] .text PROGBITS 0000000000000000 000040 000050
00 AX 0 0 8
[ 3] .rel.text REL 0000000000000000 000178 000010
10 8 2 8
[ 4] .rodata PROGBITS 0000000000000000 000090 000004
00 A 0 0 4
[ 5] .BTF PROGBITS 0000000000000000 000094 000019
00 0 0 1
[ 6] .BTF.ext PROGBITS 0000000000000000 0000ad 000020
00 0 0 1
[ 7] .llvm_addrsig LLVM_ADDRSIG 0000000000000000 000188 000001
00 E 8 0 1
[ 8] .symtab SYMTAB 0000000000000000 0000d0 0000a8
18 1 6 8
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
-bash-4.4$ llvm-readelf -s g1.o
Symbol table '.symtab' contains 7 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS g1.c
2: 0000000000000030 0 NOTYPE LOCAL DEFAULT 2 LBB0_2
3: 0000000000000048 0 NOTYPE LOCAL DEFAULT 2 LBB0_4
4: 0000000000000000 4 OBJECT LOCAL DEFAULT 4 __kernel_version
5: 0000000000000000 0 SECTION LOCAL DEFAULT 4 .rodata
6: 0000000000000000 80 FUNC GLOBAL DEFAULT 2 prog
-bash-4.4$
The relocation is for the first insn.
The address is the start of .rodata section, which happens to
match variable __kernel_version (size 4).
The libbpf API can provide a way for user to assign a value to readonly
section. In this particular case, e.g., on HostA, __kernel_version is
assigned to 416, which means the first 4 bytes of .rodata is modified
to have value 416. Considering this is a generic interface, the API
may look like
bpf_object__change_readonly_value(const char *var_name, void *val_buf,
unsigned var_buf_size);
The libbpf will change the value if there is a "var_name" in rodata
section and val_buf_size matches the size in the symbol table.
> could get away with doing this from loader side if it's simple enough,
> though one thing that would be good to avoid is to duplicate all the
> complex branch fixup logic etc that we have in kernel already. Are you
I totally agree that kernel is already able to prune dead codes while
maintaining correct func/line info. We should do that part in kernel.
Let us look at the byte codes,
0000000000000000 prog:
0: r1 = 0 ll
2: r1 = *(u32 *)(r1 + 0)
3: r2 = 1
4: if r1 == 416 goto +1 <LBB0_2>
5: r2 = 2
0000000000000030 LBB0_2:
6: r0 = 0
7: if r1 == 411 goto +1 <LBB0_4>
8: r0 = r2
0000000000000048 LBB0_4:
9: exit
Here, the goal is to let r1 at insn #2 get the constant.
Do you think we can get it from the kernel? In this particular case,
insn #0, get a romap_ptr with addr of rodata section offset 0,
insn #2, load u32 from romap offset 0, the value is already populated,
e.g., 416.
The verifier is path sensitive, will need extra care to
perform such transformation in case it is invalid in different paths.
Maybe slightly extension of verifier is able to do this?
Initially we do not need to handle complicated cases. Most global/static
variable accesses are all like
r1 = #num ll
r1 = *(type *)(r1 + offset)
If there is branch into the middle of the above pair of insns
and r1 is romap_ptr, it is totally safe to replace the second insn
as r1 = constant which can enable later dead code elimination.
If all read only region access are converted to constants,
"r1 = #num ll" ld_imm64 insns can be removed as well.
> thinking to mark these via BTF in some way such that loader does inline
> replacement?
I have not thought about BTF. BTF could provide information about insn
#2 referring to a particular readonly section location. But looks like
verifier is able to track it as well in the above?
Let us first study whether without BTF is okay. If needed, we can go
through BTF path with compiler assistance.
>
> Thanks,
> Daniel
>
Powered by blists - more mailing lists