[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <930f735f-cb35-41b9-9ab9-b3ee558b27ee@linux.dev>
Date: Sun, 18 Jan 2026 22:20:07 -0800
From: Yonghong Song <yonghong.song@...ux.dev>
To: Qiliang Yuan <realwujing@...il.com>, eddyz87@...il.com
Cc: andrii@...nel.org, ast@...nel.org, bpf@...r.kernel.org,
daniel@...earbox.net, haoluo@...gle.com, jolsa@...nel.org,
kpsingh@...nel.org, linux-kernel@...r.kernel.org, martin.lau@...ux.dev,
sdf@...ichev.me, song@...nel.org, yuanql9@...natelecom.cn
Subject: Re: [PATCH v3] bpf/verifier: optimize precision backtracking by
skipping precise bits
On 1/17/26 2:09 AM, Qiliang Yuan wrote:
> Optimize __mark_chain_precision() by skipping registers or stack slots
> that are already marked as precise. This prevents redundant history
> walks when multiple verification paths converge on the same state.
>
> Centralizing this check in __mark_chain_precision() improves efficiency
> for all entry points (mark_chain_precision, propagate_precision) and
> simplifies call-site logic.
>
> Performance Results (Extreme Stress Test on 32-core system):
> Under system-wide saturation (32 parallel veristat instances) using a
> high-stress backtracking payload (~290k insns, 34k states), this
> optimization demonstrated significant micro-architectural gains:
>
> - Total Retired Instructions: -82.2 Billion (-1.94%)
> - Total CPU Cycles: -161.3 Billion (-3.11%)
> - Avg. Insns per Verify: -17.2 Million (-2.84%)
> - Page Faults: -39.90% (Significant reduction in memory pressure)
>
> The massive reduction in page faults suggests that avoiding redundant
> backtracking significantly lowers memory subsystem churn during deep
> state history walks.
>
> Verified that total instruction and state counts (per veristat) remain
> identical across all tests, confirming logic equivalence.
>
> Suggested-by: Eduard Zingerman <eddyz87@...il.com>
> Signed-off-by: Qiliang Yuan <realwujing@...il.com>
> ---
> On Fri, Jan 16, 2026 at 11:27 PM Eduard Zingerman <eddyz87@...il.com> wrote:
>> As I said before, this is a useful change.
>>
>>> 4. bpf/verifier: optimize precision backtracking by skipping precise bits
>>> (https://lore.kernel.org/all/20260115152037.449362-1-realwujing@gmail.com/)
>>> Following your suggestion to refactor the logic into the core engine for
>>> better coverage and clarity, I have provided a v2 version of this patch here:
>>> (https://lore.kernel.org/all/20260116045839.23743-1-realwujing@gmail.com/)
>>> This v2 version specifically addresses your feedback by centralizing the
>>> logic and includes a comprehensive performance comparison (veristat results)
>>> in the commit log. It reduces the complexity of redundant backtracking
>>> requests from O(D) (where D is history depth) to O(1) by utilizing the
>>> 'precise' flag to skip already-processed states.
>> Same as with #1: using veristat duration metric, especially for such
>> small programs, is not a reasonable performance analysis.
> Link: https://lore.kernel.org/all/75807149f7de7a106db0ccda88e5d4439b94a1e7.camel@gmail.com/
>
> Hi Eduard,
>
> Acknowledged. To provide a more robust performance analysis, I have moved away
> from veristat duration and instead used hardware performance counters (perf stat)
> under system-wide saturation with a custom backtracking stress test. This
> demonstrates the optimization's hardware-level efficiency (retired instructions
> and page faults) more reliably.
>
> Best regards,
> Qiliang
>
> Test case (backtrack_stress.c):
> #include "vmlinux.h"
> #include <bpf/bpf_helpers.h>
>
> struct {
> __uint(type, BPF_MAP_TYPE_ARRAY);
> __uint(max_entries, 1);
> __type(key, __u32);
> __type(value, __u64);
> } dummy_map SEC(".maps");
>
> SEC("tc")
> int backtrack_stress(struct __sk_buff *skb)
> {
> __u32 key = 0;
> __u64 *val = bpf_map_lookup_elem(&dummy_map, &key);
> if (!val) return 0;
> __u64 x = *val;
>
> /* 1. Create a deep dependency chain to fill history for 'x' */
> x += 1; x *= 2; x -= 1; x ^= 0x55;
> x += 1; x *= 2; x -= 1; x ^= 0xAA;
> x += 1; x *= 2; x -= 1; x ^= 0x55;
> x += 1; x *= 2; x -= 1; x ^= 0xAA;
>
> /* 2. Create many states via conditional branches */
> #define CHECK_X(n) if (x == n) { x += 1; } if (x == n + 1) { x -= 1; }
> #define CHECK_X10(n) CHECK_X(n) CHECK_X(n+2) CHECK_X(n+4) CHECK_X(n+6) CHECK_X(n+8) \
> CHECK_X(n+10) CHECK_X(n+12) CHECK_X(n+14) CHECK_X(n+16) CHECK_X(n+18)
> #define CHECK_X100(n) CHECK_X10(n) CHECK_X10(n+20) CHECK_X10(n+40) CHECK_X10(n+60) CHECK_X10(n+80) \
> CHECK_X10(n+100) CHECK_X10(n+120) CHECK_X10(n+140) CHECK_X10(n+160) CHECK_X10(n+180)
>
> CHECK_X100(0)
> CHECK_X100(200)
> CHECK_X100(400)
> CHECK_X100(600)
> CHECK_X100(800)
> CHECK_X100(1000)
>
> /* 3. Trigger mark_chain_precision() multiple times on 'x' */
> #pragma clang loop unroll(full)
> for (int i = 0; i < 500; i++) {
> if (x == (2000 + i)) {
> x += 1;
> }
> }
>
> return x;
> }
>
> char _license[] SEC("license") = "GPL";
>
> How to Test:
> -----------
> 1. Compile the BPF program (from kernel root):
> clang -O2 -target bpf \
> -I./tools/testing/selftests/bpf/ \
> -I./tools/lib/ \
> -I./tools/include/uapi/ \
> -I./tools/testing/selftests/bpf/include \
> -c backtrack_stress.c -o backtrack_stress.bpf.o
>
> 2. System-wide saturation profiling (32 cores):
> # Start perf in background
> sudo perf stat -a -- sleep 60 &
> # Start 32 parallel loops of veristat
> for i in {1..32}; do (while true; do ./veristat backtrack_stress.bpf.o > /dev/null; done &); done
>
> Raw Performance Data:
> ---------------------
> Baseline (6.19.0-rc5-baseline, git commit 944aacb68baf):
> File Program Verdict Duration (us) Insns States Program size Jited size
> ---------------------- ---------------- ------- ------------- ------ ------ ------------ ----------
> backtrack_stress.bpf.o backtrack_stress success 197924 289939 34331 5437 28809
> ---------------------- ---------------- ------- ------------- ------ ------ ------------ ----------
>
> 1,388,149 context-switches # 722.5 cs/sec cs_per_second
> 1,921,399.69 msec cpu-clock # 32.0 CPUs CPUs_utilized
> 25,113 cpu-migrations # 13.1 migrations/sec migrations_per_second
> 8,108,516 page-faults # 4220.1 faults/sec page_faults_per_second
> 97,445,724,421 branch-misses # 8.1 % branch_miss_rate (50.07%)
> 903,852,287,721 branches # 470.4 M/sec branch_frequency (66.76%)
> 5,190,519,089,751 cpu-cycles # 2.7 GHz cycles_frequency (66.81%)
> 4,230,500,391,043 instructions # 0.8 instructions insn_per_cycle (66.76%)
> 1,853,856,616,836 stalled-cycles-frontend # 0.36 frontend_cycles_idle (66.52%)
>
> 60.031936126 seconds time elapsed
>
> Patched (6.19.0-rc5-optimized):
> File Program Verdict Duration (us) Insns States Program size Jited size
> ---------------------- ---------------- ------- ------------- ------ ------ ------------ ----------
> backtrack_stress.bpf.o backtrack_stress success 214600 289939 34331 5437 28809
> ---------------------- ---------------- ------- ------------- ------ ------ ------------ ----------
>
> 1,433,270 context-switches # 745.9 cs/sec cs_per_second
> 1,921,604.54 msec cpu-clock # 32.0 CPUs CPUs_utilized
> 22,795 cpu-migrations # 11.9 migrations/sec migrations_per_second
> 4,873,895 page-faults # 2536.4 faults/sec page_faults_per_second
> 97,038,959,375 branch-misses # 8.1 % branch_miss_rate (50.07%)
> 890,170,312,491 branches # 463.2 M/sec branch_frequency (66.76%)
> 5,029,192,994,167 cpu-cycles # 2.6 GHz cycles_frequency (66.81%)
> 4,148,237,426,723 instructions # 0.8 instructions insn_per_cycle (66.77%)
> 1,818,457,318,301 stalled-cycles-frontend # 0.36 frontend_cycles_idle (66.51%)
>
> 60.032523872 seconds time elapsed
>
> kernel/bpf/verifier.c | 27 +++++++++++++++++++++++++--
> 1 file changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 3135643d5695..250f1dc0298e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4765,14 +4765,37 @@ static int __mark_chain_precision(struct bpf_verifier_env *env,
> * slot, but don't set precise flag in current state, as precision
> * tracking in the current state is unnecessary.
> */
> - func = st->frame[bt->frame];
> if (regno >= 0) {
> - reg = &func->regs[regno];
> + reg = &st->frame[bt->frame]->regs[regno];
> if (reg->type != SCALAR_VALUE) {
> verifier_bug(env, "backtracking misuse");
> return -EFAULT;
> }
> + if (reg->precise)
> + return 0;
> bt_set_reg(bt, regno);
> + } else {
> + for (fr = bt->frame; fr >= 0; fr--) {
> + u32 reg_mask = bt_frame_reg_mask(bt, fr);
> + u64 stack_mask = bt_frame_stack_mask(bt, fr);
> + DECLARE_BITMAP(mask, 64);
> +
> + func = st->frame[fr];
> + if (reg_mask) {
> + bitmap_from_u64(mask, reg_mask);
> + for_each_set_bit(i, mask, 32) {
> + if (func->regs[i].precise)
> + bt_clear_frame_reg(bt, fr, i);
> + }
> + }
> + if (stack_mask) {
> + bitmap_from_u64(mask, stack_mask);
> + for_each_set_bit(i, mask, 64) {
> + if (func->stack[i].spilled_ptr.precise)
> + bt_clear_frame_slot(bt, fr, i);
> + }
> + }
> + }
> }
>
> if (bt_empty(bt))
Here is another data point. I applied this patch to latest bpf-next
and adding printk()'s like below:
...
+ if (reg->precise) {
+ printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+ return 0;
+ }
...
+ for (fr = bt->frame; fr >= 0; fr--) {
+ u32 reg_mask = bt_frame_reg_mask(bt, fr);
+ u64 stack_mask = bt_frame_stack_mask(bt, fr);
+ DECLARE_BITMAP(mask, 64);
+
+ func = st->frame[fr];
+ if (reg_mask) {
+ bitmap_from_u64(mask, reg_mask);
+ for_each_set_bit(i, mask, 32) {
+ if (func->regs[i].precise) {
+ printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+ bt_clear_frame_reg(bt, fr, i);
+ }
+ }
+ }
+ if (stack_mask) {
+ bitmap_from_u64(mask, stack_mask);
+ for_each_set_bit(i, mask, 64) {
+ if (func->stack[i].spilled_ptr.precise) {
+ printk("%s %s %d\n", __FILE__, __func__, __LINE__);
+ bt_clear_frame_slot(bt, fr, i);
+ }
+ }
+ }
I run through the bpf selftests, and didn't see a single printk dump in the above.
So looks like this patch is a noop for me.
Powered by blists - more mailing lists