netdev - BPF sk_lookup v5 - TCP SYN and UDP 0-len flood benchmarks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87lficrm2v.fsf@cloudflare.com>
Date:   Tue, 18 Aug 2020 17:49:12 +0200
From:   Jakub Sitnicki <jakub@...udflare.com>
To:     bpf@...r.kernel.org
Cc:     netdev@...r.kernel.org, kernel-team@...udflare.com,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Andrii Nakryiko <andriin@...com>,
        Lorenz Bauer <lmb@...udflare.com>,
        Marek Majkowski <marek@...udflare.com>,
        Martin KaFai Lau <kafai@...com>, Yonghong Song <yhs@...com>
Subject: BPF sk_lookup v5 - TCP SYN and UDP 0-len flood benchmarks

I got around to re-running flood benchmarks. Mainly to confirm that
introduction of static key had the desired effect - users not attaching
BPF sk_lookup programs won't notice a performance hit in Linux v5.9.

But also to check for any unexpected bottlenecks when BPF sk_lookup
program is attached, like struct in6_addr copying that turned out to be
a bad idea in v1.

The test setup has been already covered in the cover letter for v1 of
the series so I'm not going to repeat it here. Please take a look at
"Performance considerations" in [0].

BPF program [1] used during benchmarks has been updated to work with the
BPF sk_lookup uAPI in v5.

RX pps and CPU cycles events were recorded in 3 configurations:

 1. 5.8-rc7 w/o this BPF sk_lookup patch series (baseline),
 2. 5.8-rc7 with patches applied, but no SK_LOOKUP program attached,
 3. 5.8-rc7 with patches applied, and SK_LOOKUP program attached;
    BPF program [1] is doing a lookup LPM_TRIE map with 200 entries.

RX pps measured with `ifpps -d <dev> -t 1000 --csv --loop` for 60 sec.

| tcp4 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 899,875 ± 1.0%         |        - |
| no SK_LOOKUP prog attached   | 889,798 ± 0.6%         |    -1.1% |
| with SK_LOOKUP prog attached | 868,885 ± 1.4%         |    -3.4% |

| tcp6 SYN flood               | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 823,364 ± 0.6%         |        - |
| no SK_LOOKUP prog attached   | 832,667 ± 0.7%         |     1.1% |
| with SK_LOOKUP prog attached | 820,505 ± 0.4%         |    -0.3% |

| udp4 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 2,486,313 ± 1.2%       |        - |
| no SK_LOOKUP prog attached   | 2,486,932 ± 0.4%       |     0.0% |
| with SK_LOOKUP prog attached | 2,340,425 ± 1.6%       |    -5.9% |

| udp6 0-len flood             | rx pps (mean ± sstdev) | Δ rx pps |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   | 2,505,270 ± 1.3%       |        - |
| no SK_LOOKUP prog attached   | 2,522,286 ± 1.3%       |     0.7% |
| with SK_LOOKUP prog attached | 2,418,737 ± 1.3%       |    -3.5% |

cpu-cycles measured with `perf record -F 999 --cpu 1-4 -g -- sleep 60`.

|                              |      cpu-cycles events |          |
| tcp4 SYN flood               | __inet_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  1.31% |        - |
| no SK_LOOKUP prog attached   |                  1.24% |    -0.1% |
| with SK_LOOKUP prog attached |                  2.59% |     1.3% |

|                              |      cpu-cycles events |          |
| tcp6 SYN flood               |  inet6_lookup_listener | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  1.28% |        - |
| no SK_LOOKUP prog attached   |                  1.22% |    -0.1% |
| with SK_LOOKUP prog attached |                  3.15% |     1.4% |

|                              |      cpu-cycles events |          |
| udp4 0-len flood             |      __udp4_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  3.70% |        - |
| no SK_LOOKUP prog attached   |                  4.13% |     0.4% |
| with SK_LOOKUP prog attached |                  7.55% |     3.9% |

|                              |      cpu-cycles events |          |
| udp6 0-len flood             |      __udp6_lib_lookup | Δ events |
|------------------------------+------------------------+----------|
| 5.8-rc7 vanilla (baseline)   |                  4.94% |        - |
| no SK_LOOKUP prog attached   |                  4.32% |    -0.6% |
| with SK_LOOKUP prog attached |                  8.07% |     3.1% |

Couple comments:

1. udp6 outperformed udp4 in our setup. The likely suspect is
   CONFIG_IP_FIB_TRIE_STATS which put fib_table_lookup at the top of
   perf report when it comes to cpu-cycles w/o counting children. It
   should have been disabled.

2. When BPF sk_lookup program is attached, the hot spot remains to be
   copying data to populate BPF context object before each program run.

   For example, snippet from perf annotate for __udp4_lib_lookup:

---8<---
         :                      rcu_read_lock();
         :                      run_array = rcu_dereference(net->bpf.run_array[NETNS_BPF_SK_LOOKUP]);
    0.01 :   ffffffff817f8624:       mov    0xd68(%r12),%rsi
         :                      if (run_array) {
    0.00 :   ffffffff817f862c:       test   %rsi,%rsi
    0.00 :   ffffffff817f862f:       je     ffffffff817f87a9 <__udp4_lib_lookup+0x2c9>
         :                      struct bpf_sk_lookup_kern ctx = {
    1.05 :   ffffffff817f8635:       xor    %eax,%eax
    0.00 :   ffffffff817f8637:       mov    $0x6,%ecx
    0.01 :   ffffffff817f863c:       movl   $0x110002,0x40(%rsp)
    0.00 :   ffffffff817f8644:       lea    0x48(%rsp),%rdi
   18.76 :   ffffffff817f8649:       rep stos %rax,%es:(%rdi)
    1.12 :   ffffffff817f864c:       mov    0xc(%rsp),%eax
    0.00 :   ffffffff817f8650:       mov    %ebp,0x48(%rsp)
    0.00 :   ffffffff817f8654:       mov    %eax,0x44(%rsp)
    0.00 :   ffffffff817f8658:       movzwl 0x10(%rsp),%eax
    1.21 :   ffffffff817f865d:       mov    %ax,0x60(%rsp)
    0.00 :   ffffffff817f8662:       movzwl 0x20(%rsp),%eax
    0.00 :   ffffffff817f8667:       mov    %ax,0x62(%rsp)
         :                      .sport          = sport,
         :                      .dport          = dport,
         :                      };
         :                      u32 act;
         :
         :                      act = BPF_PROG_SK_LOOKUP_RUN_ARRAY(run_array, ctx, BPF_PROG_RUN);
--->8---

   Looking at the RX pps drop this is not something we're concerned with
   ATM. The overhead will drown in cycles burned in iptables, which were
   intentionally unloaded for the benchmark.

   If someone has an idea how to tune it, though, I'm all ears.

Thanks,
-jkbs

[0] https://lore.kernel.org/bpf/20200506125514.1020829-1-jakub@cloudflare.com/
[1] https://github.com/majek/inet-tool/blob/master/ebpf/inet-kern.c