[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <1553623539-15474-1-git-send-email-jiong.wang@netronome.com>
Date: Tue, 26 Mar 2019 18:05:23 +0000
From: Jiong Wang <jiong.wang@...ronome.com>
To: alexei.starovoitov@...il.com, daniel@...earbox.net
Cc: bpf@...r.kernel.org, netdev@...r.kernel.org,
oss-drivers@...ronome.com, Jiong Wang <jiong.wang@...ronome.com>,
"David S . Miller" <davem@...emloft.net>,
Paul Burton <paul.burton@...s.com>,
Wang YanQing <udknight@...il.com>,
Zi Shen Lim <zlim.lnx@...il.com>,
Shubham Bansal <illusionist.neo@...il.com>,
"Naveen N . Rao" <naveen.n.rao@...ux.ibm.com>,
Sandipan Das <sandipan@...ux.ibm.com>,
Martin Schwidefsky <schwidefsky@...ibm.com>,
Heiko Carstens <heiko.carstens@...ibm.com>,
Jakub Kicinski <jakub.kicinski@...ronome.com>
Subject: [PATCH/RFC bpf-next 00/16] bpf: eliminate zero extensions for sub-register writes
eBPF ISA specification requires high 32-bit cleared when low 32-bit
sub-register is written. This applies to destination register of
ALU32/LD_H/B/W etc. JIT back-ends must guarantee this semantic when doing
code-gen.
x86-64 and arm64 ISA has the same semantics, so the corresponding JIT
back-end doesn't need to do extra work. However, 32-bit arches (arm, nfp
etc.) and some other 64-bit arches (powerpc, sparc etc), need explicitly
zero extension sequence to meet such semantic.
This is important, because for C code like the following:
u64_value = (u64) u32_value
... other uses of u64_value
compiler could exploit the semantic described above and save those zero
extensions for extending u32_value to u64_value. Hardware, runtime, or BPF
JIT back-ends, are responsible for guaranteeing this. Some benchmarks shows
~40% sub-register writes out of total insns, meaning ~40% extra code-gen
and could go up for arches requiring two shifts for zero extension. All
these are because JIT back-end needs to do extra code-gen for all such
instructions, always.
However this is not always necessary in case u32 value is never cast into a
u64, which is quite normal in real life program. So, it would be really
good if we could identify those places where such type cast happened, and
only do zero extensions for them, not for the others. This could save a lot
of BPF code-gen.
Algo
====
We could use insn scan based static analysis to tell whether one
sub-register def doesn't need zero extension. However, using such static
analysis, we must do conservative assumption at branching point where
multiple uses could be introduced. So, for any sub-register def that is
active at branching point, we need to mark it as needing zero extension.
This could introducing quite a few false alarms, for example ~25% on
Cilium bpf_lxc.
It will be far better to use dynamic data-flow tracing which verifier
fortunately already has and could be easily extend to serve the purpose of
this patch set.
- Record indices of instructions that do sub-register def (write). And
these indices need to stay with function state so path pruning and bpf
to bpf function call could be handled properly.
These indices are kept up to date while doing insn walk.
- A full register read on an active sub-register def marks the def insn as
needing zero extension on dst register.
- A new sub-register write overrides the old one.
A new full register write makes the register free of zero extension on
dst register.
- When propagating register read64 during path pruning, it also marks def
insns whose defs are hanging active sub-register, if there is any read64
from shown from the equal state.
The core patch in this set is patch 4.
Benchmark
=========
- I estimate the JITed image could be 25% smaller on average on all these
affected arches (nfp, arm, x32, risv, ppc, sparc, s390).
- The implementation is based on existing register read liveness tracking
infrastructure, so it is dynamic tracking and would trace all possible
code paths, therefore, it shouldn't be any false alarm.
For Cilium bpf_lxc, there is ~11500 insns in the compiled binary (use
latest LLVM snapshot, and with -mcpu=v3 -mattr=+alu32 enabled), 4460 of
them has sub-register writes (~40%). Calculated by:
cat dump | grep -P "\tw" | wc -l (ALU32)
cat dump | grep -P "r.*=.*u32" | wc -l (READ_W)
cat dump | grep -P "r.*=.*u16" | wc -l (READ_H)
cat dump | grep -P "r.*=.*u8" | wc -l (READ_B)
After this patch set enabled, 647 out of those 4460 are identified as
really needing zero extension on the destination, then it is safe for
JIT back-ends to eliminate zero extension for all the other instructions
which is ~85% of all those sub-register write insns or 33% of total
insns. It is a significant save.
For those 647 insns marked as needing zero extension, part of them are
setting up u64 parameters for help calls, remaining ones are those whose
sub-register defs really have 64-bit reads.
Tetsting
========
A couple of unit tests have been added including cases for path pruning
etc. It would be great if any affected host arch could apply this patch set
then test bpf selftest under sub-register code-gen mode, test_progs which
contains quite a few C programs will be a very good coverage.
ToDo
====
- eBPF doesn't have zero extension instruction, so lshift and rshift are
used to do it, meaning two native insns while JIT back-ends could just
use one truncate insn if they understand the lshift + rshift is just
doing zero extension.
There are three approaches to fix this:
- Minor change on back-end JIT hook, also pass aux_insn information to
back-ends so they could have per insn information and they could do
zero extension for the marked insn themselves using the most
efficient native insn.
- Introduce zero extension insn for eBPF. Then verifier could insert
the new zext insn instead of lshift + rshift. zext could be JITed
more efficiently.
NOTE: existing MOV64 can't be used as zero extension insn, because
after zext optimization enabled, MOV64 doesn't necessarily
clear high 32-bit.
- Otherwise JIT back-ends need to do peephole to catch lshift + rshift
and turn them into native zext.
- In this set, for JIT back-ends except NFP, I have only enabled the
optimization for ALU32 instructions, while it could be easily enabled
for load instruction. I could do a follow up patch set for this.
Reviews
=======
- Fixed the missing handling on callee-saved for bpf-to-bpf call,
sub-register defs therefore moved to frame state. (Jakub Kicinski)
- Removed redundant "cross_reg". (Jakub Kicinski)
- Various coding styles & grammar fixes. (Jakub Kicinski, Quentin Monnet)
Cc: David S. Miller <davem@...emloft.net>
Cc: Paul Burton <paul.burton@...s.com>
Cc: Wang YanQing <udknight@...il.com>
Cc: Zi Shen Lim <zlim.lnx@...il.com>
Cc: Shubham Bansal <illusionist.neo@...il.com>
Cc: Naveen N. Rao <naveen.n.rao@...ux.ibm.com>
Cc: Sandipan Das <sandipan@...ux.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@...ibm.com>
Cc: Heiko Carstens <heiko.carstens@...ibm.com>
Cc: Jakub Kicinski <jakub.kicinski@...ronome.com>
Jiong Wang (16):
bpf: turn "enum bpf_reg_liveness" into bit representation
bpf: refactor propagate_live implementation
bpf: split read liveness into REG_LIVE_READ64 and REG_LIVE_READ32
bpf: mark sub-register writes that really need zero extension to high
bits
bpf: reduce false alarm by refining "enum bpf_arg_type"
bpf: new sysctl "bpf_jit_32bit_opt"
bpf: insert explicit zero extension instructions when
bpf_jit_32bit_opt is true
arm: bpf: eliminate zero extension code-gen
powerpc: bpf: eliminate zero extension code-gen
s390: bpf: eliminate zero extension code-gen
sparc: bpf: eliminate zero extension code-gen
x32: bpf: eliminate zero extension code-gen
riscv: bpf: eliminate zero extension code-gen
nfp: bpf: eliminate zero extension code-gen
selftests: bpf: new field "xlated_insns" for insn scan test after
verification
selftests: bpf: unit testcases for zero extension insertion pass
Documentation/sysctl/net.txt | 15 +
arch/arm/net/bpf_jit_32.c | 22 +-
arch/powerpc/net/bpf_jit_comp64.c | 7 +-
arch/riscv/net/bpf_jit_comp.c | 32 +-
arch/s390/net/bpf_jit_comp.c | 13 +-
arch/sparc/net/bpf_jit_comp_64.c | 8 +-
arch/x86/net/bpf_jit_comp32.c | 32 +-
drivers/net/ethernet/netronome/nfp/bpf/jit.c | 119 ++--
drivers/net/ethernet/netronome/nfp/bpf/main.h | 2 +
drivers/net/ethernet/netronome/nfp/bpf/verifier.c | 12 +
include/linux/bpf.h | 4 +
include/linux/bpf_verifier.h | 23 +-
include/linux/filter.h | 2 +
kernel/bpf/core.c | 18 +-
kernel/bpf/helpers.c | 2 +-
kernel/bpf/verifier.c | 216 +++++--
net/core/filter.c | 28 +-
net/core/sysctl_net_core.c | 9 +
tools/testing/selftests/bpf/test_verifier.c | 220 +++++++-
tools/testing/selftests/bpf/verifier/zext.c | 651 ++++++++++++++++++++++
20 files changed, 1286 insertions(+), 149 deletions(-)
create mode 100644 tools/testing/selftests/bpf/verifier/zext.c
--
2.7.4
Powered by blists - more mailing lists