[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5eeb0e5dcb010_8712abba49be5bc91@john-XPS-13-9370.notmuch>
Date: Wed, 17 Jun 2020 23:49:01 -0700
From: John Fastabend <john.fastabend@...il.com>
To: Andrii Nakryiko <andriin@...com>, bpf@...r.kernel.org,
netdev@...r.kernel.org, ast@...com, daniel@...earbox.net
Cc: andrii.nakryiko@...il.com, kernel-team@...com,
Andrii Nakryiko <andriin@...com>
Subject: RE: [PATCH bpf-next 1/2] bpf: switch most helper return values from
32-bit int to 64-bit long
Andrii Nakryiko wrote:
> Switch most of BPF helper definitions from returning int to long. These
> definitions are coming from comments in BPF UAPI header and are used to
> generate bpf_helper_defs.h (under libbpf) to be later included and used from
> BPF programs.
>
> In actual in-kernel implementation, all the helpers are defined as returning
> u64, but due to some historical reasons, most of them are actually defined as
> returning int in UAPI (usually, to return 0 on success, and negative value on
> error).
Could we change the helpers side to return correct types now? Meaning if the
UAPI claims its an int lets actually return the int.
>
> This actually causes Clang to quite often generate sub-optimal code, because
> compiler believes that return value is 32-bit, and in a lot of cases has to be
> up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
> before they can be used further in BPF code.
>
> Besides just "polluting" the code, these 32-bit shifts quite often cause
> problems for cases in which return value matters. This is especially the case
> for the family of bpf_probe_read_str() functions. There are few other similar
> helpers (e.g., bpf_read_branch_records()), in which return value is used by
> BPF program logic to record variable-length data and process it. For such
> cases, BPF program logic carefully manages offsets within some array or map to
> read variable-length data. For such uses, it's crucial for BPF verifier to
> track possible range of register values to prove that all the accesses happen
> within given memory bounds. Those extraneous zero-extending bit shifts,
> inserted by Clang (and quite often interleaved with other code, which makes
> the issues even more challenging and sometimes requires employing extra
> per-variable compiler barriers), throws off verifier logic and makes it mark
> registers as having unknown variable offset. We'll study this pattern a bit
> later below.
With latest verifier zext with alu32 support should be implemented as a
MOV insn.
>
> Another common pattern is to check return of BPF helper for non-zero state to
> detect error conditions and attempt alternative actions in such case. Even in
> this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
> quite often leads to sub-optimal and unnecessary extra code. We'll look at
> this pattern as well.
>
> Clang's BPF target supports two modes of code generation: ALU32, in which it
> is capable of using lower 32-bit parts of registers, and no-ALU32, in which
> only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
> above described problems, but not in all cases.
A bit curious, do you see users running with no-ALU32 support? I have enabled
it by default now. It seems to generate better code and with latest 32-bit
bounds tracking I haven't hit any issues with verifier.
>
> This patch switches all the cases in which BPF helpers return 0 or negative
> error from returning int to returning long. It is shown below that such change
> in definition leads to equivalent or better code. No-ALU32 mode benefits more,
> but ALU32 mode doesn't degrade or still gets improved code generation.
>
> Another class of cases switched from int to long are bpf_probe_read_str()-like
> helpers, which encode successful case as non-negative values, while still
> returning negative value for errors.
>
> In all of such cases, correctness is preserved due to two's complement
> encoding of negative values and the fact that all helpers return values with
> 32-bit absolute value. Two's complement ensures that for negative values
> higher 32 bits are all ones and when truncated, leave valid negative 32-bit
> value with the same value. Non-negative values have upper 32 bits set to zero
> and similarly preserve value when high 32 bits are truncated. This means that
> just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
> require any extra shifts).
>
> To minimize the chances of regressions, two code patterns were investigated,
> as mentioned above. For both patterns, BPF assembly was analyzed in
> ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
> new 64-bit long return type.
>
> Case 1. Variable-length data reading and concatenation. This is quite
> ubiquitous pattern in tracing/monitoring applications, reading data like
> process's environment variables, file path, etc. In such case, many pieces of
> string-like variable-length data are read into a single big buffer, and at the
> end of the process, only a part of array containing actual data is sent to
> user-space for further processing. This case is tested in test_varlen.c
> selftest (in the next patch). Code flow is roughly as follows:
>
> void *payload = &sample->payload;
> u64 len;
>
> len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
> if (len <= MAX_SZ1) {
> payload += len;
> sample->len1 = len;
> }
> len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
> if (len <= MAX_SZ2) {
> payload += len;
> sample->len2 = len;
> }
> /* and so on */
> sample->total_len = payload - &sample->payload;
> /* send over, e.g., perf buffer */
>
> There could be two variations with slightly different code generated: when len
> is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
> BPF assembly instructions between two successive invocations of
> bpf_probe_read_kernel_str() were used to check code regressions. Results are
> below, followed by short analysis. Left side is using helpers with int return
> type, the right one is after the switch to long.
>
> ALU32 + INT ALU32 + LONG
> =========== ============
>
> 64-BIT (13 insns): 64-BIT (10 insns):
> ------------------------------------ ------------------------------------
> 17: call 115 17: call 115
> 18: if w0 > 256 goto +9 <LBB0_4> 18: if r0 > 256 goto +6 <LBB0_4>
> 19: w1 = w0 19: r1 = 0 ll
> 20: r1 <<= 32 21: *(u64 *)(r1 + 0) = r0
> 21: r1 s>>= 32 22: r6 = 0 ll
What version of clang is this? That is probably a zext in llvm-ir that in
latest should be sufficient with the 'w1=w0'. I'm guessing (hoping?) you
might not have latest?
> 22: r2 = 0 ll 24: r6 += r0
> 24: *(u64 *)(r2 + 0) = r1 00000000000000c8 <LBB0_4>:
> 25: r6 = 0 ll 25: r1 = r6
> 27: r6 += r1 26: w2 = 256
> 00000000000000e0 <LBB0_4>: 27: r3 = 0 ll
> 28: r1 = r6 29: call 115
> 29: w2 = 256
> 30: r3 = 0 ll
> 32: call 115
>
> 32-BIT (11 insns): 32-BIT (12 insns):
> ------------------------------------ ------------------------------------
> 17: call 115 17: call 115
> 18: if w0 > 256 goto +7 <LBB1_4> 18: if w0 > 256 goto +8 <LBB1_4>
> 19: r1 = 0 ll 19: r1 = 0 ll
> 21: *(u32 *)(r1 + 0) = r0 21: *(u32 *)(r1 + 0) = r0
> 22: w1 = w0 22: r0 <<= 32
> 23: r6 = 0 ll 23: r0 >>= 32
> 25: r6 += r1 24: r6 = 0 ll
> 00000000000000d0 <LBB1_4>: 26: r6 += r0
> 26: r1 = r6 00000000000000d8 <LBB1_4>:
> 27: w2 = 256 27: r1 = r6
> 28: r3 = 0 ll 28: w2 = 256
> 30: call 115 29: r3 = 0 ll
> 31: call 115
>
> In ALU32 mode, the variant using 64-bit length variable clearly wins and
> avoids unnecessary zero-extension bit shifts. In practice, this is even more
> important and good, because BPF code won't need to do extra checks to "prove"
> that payload/len are within good bounds.
I bet with latest clang the shifts are removed. But if not we probably
should fix clang regardless of if helpers return longs or ints.
>
> 32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
> with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
> uses extra register. The latter might potentially lose some range information,
> but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
> after check at 18:, and shifting 32 bits left/right keeps that range intact.
> We should probably look into Clang's logic and see why it chooses bitshifts
> over sub-register assignments for this.
>
> NO-ALU32 + INT NO-ALU32 + LONG
> ============== ===============
>
> 64-BIT (14 insns): 64-BIT (10 insns):
> ------------------------------------ ------------------------------------
> 17: call 115 17: call 115
> 18: r0 <<= 32 18: if r0 > 256 goto +6 <LBB0_4>
> 19: r1 = r0 19: r1 = 0 ll
> 20: r1 >>= 32 21: *(u64 *)(r1 + 0) = r0
> 21: if r1 > 256 goto +7 <LBB0_4> 22: r6 = 0 ll
> 22: r0 s>>= 32 24: r6 += r0
> 23: r1 = 0 ll 00000000000000c8 <LBB0_4>:
> 25: *(u64 *)(r1 + 0) = r0 25: r1 = r6
> 26: r6 = 0 ll 26: r2 = 256
> 28: r6 += r0 27: r3 = 0 ll
> 00000000000000e8 <LBB0_4>: 29: call 115
> 29: r1 = r6
> 30: r2 = 256
> 31: r3 = 0 ll
> 33: call 115
>
> 32-BIT (13 insns): 32-BIT (13 insns):
> ------------------------------------ ------------------------------------
> 17: call 115 17: call 115
> 18: r1 = r0 18: r1 = r0
> 19: r1 <<= 32 19: r1 <<= 32
> 20: r1 >>= 32 20: r1 >>= 32
> 21: if r1 > 256 goto +6 <LBB1_4> 21: if r1 > 256 goto +6 <LBB1_4>
> 22: r2 = 0 ll 22: r2 = 0 ll
> 24: *(u32 *)(r2 + 0) = r0 24: *(u32 *)(r2 + 0) = r0
> 25: r6 = 0 ll 25: r6 = 0 ll
> 27: r6 += r1 27: r6 += r1
> 00000000000000e0 <LBB1_4>: 00000000000000e0 <LBB1_4>:
> 28: r1 = r6 28: r1 = r6
> 29: r2 = 256 29: r2 = 256
> 30: r3 = 0 ll 30: r3 = 0 ll
> 32: call 115 32: call 115
>
> In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
> superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
> len, code is identical.
Right I can't think of any way clang can avoid it here. OTOH I fix this
by enabling alu32 ;)
>
> So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
> difference stems from internal Clang decision, rather than compiler lacking
> enough information about types.
>
> Case 2. Let's look at the simpler case of checking return result of BPF helper
> for errors. The code is very simple:
>
> long bla;
> if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
> return 1;
> else
> return 0;
>
> ALU32 + CHECK (9 insns) ALU32 + CHECK (9 insns)
> ==================================== ====================================
> 0: r1 = r10 0: r1 = r10
> 1: r1 += -8 1: r1 += -8
> 2: w2 = 8 2: w2 = 8
> 3: r3 = 0 3: r3 = 0
> 4: call 113 4: call 113
> 5: w1 = w0 5: r1 = r0
> 6: w0 = 1 6: w0 = 1
> 7: if w1 != 0 goto +1 <LBB2_2> 7: if r1 != 0 goto +1 <LBB2_2>
> 8: w0 = 0 8: w0 = 0
> 0000000000000048 <LBB2_2>: 0000000000000048 <LBB2_2>:
> 9: exit 9: exit
>
> Almost identical code, the only difference is the use of full register
> assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
> architectures, new BPF assembly might be slightly less optimal, in theory. But
> one can argue that's not a big issue, given that use of full registers is
> still prevalent (e.g., for parameter passing).
>
> NO-ALU32 + CHECK (11 insns) NO-ALU32 + CHECK (9 insns)
> ==================================== ====================================
> 0: r1 = r10 0: r1 = r10
> 1: r1 += -8 1: r1 += -8
> 2: r2 = 8 2: r2 = 8
> 3: r3 = 0 3: r3 = 0
> 4: call 113 4: call 113
> 5: r1 = r0 5: r1 = r0
> 6: r1 <<= 32 6: r0 = 1
> 7: r1 >>= 32 7: if r1 != 0 goto +1 <LBB2_2>
> 8: r0 = 1 8: r0 = 0
> 9: if r1 != 0 goto +1 <LBB2_2> 0000000000000048 <LBB2_2>:
> 10: r0 = 0 9: exit
> 0000000000000058 <LBB2_2>:
> 11: exit
>
> NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
> shifts.
It seems a win for the NO-ALU32 case but for the +ALU32 case I think its
the same with latest clang although I haven't tried yet. I was actually
considering going the other way and avoiding always returning u64 on
the other side. From a purely aesethetics point of view I prefer the
int type because it seems more clear/standard C. I'm also not so interested
in optimizing the no-alu32 case but curious if there is a use case for
that?
Powered by blists - more mailing lists