lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABRcYmLAzhG=o2wcBNBtFP34Aj3+eYsEMtMREDT7SqNzBc9-qw@mail.gmail.com>
Date:   Fri, 30 Jun 2023 19:20:42 +0200
From:   Florent Revest <revest@...omium.org>
To:     Puranjay Mohan <puranjay12@...il.com>
Cc:     ast@...nel.org, daniel@...earbox.net, andrii@...nel.org,
        martin.lau@...ux.dev, song@...nel.org, catalin.marinas@....com,
        mark.rutland@....com, bpf@...r.kernel.org, kpsingh@...nel.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH bpf-next v4 0/3] bpf, arm64: use BPF prog pack allocator
 in BPF JIT

On Mon, Jun 26, 2023 at 10:58 AM Puranjay Mohan <puranjay12@...il.com> wrote:
>
> BPF programs currently consume a page each on ARM64. For systems with many BPF
> programs, this adds significant pressure to instruction TLB. High iTLB pressure
> usually causes slow down for the whole system.
>
> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue.
> It packs multiple BPF programs into a single huge page. It is currently only
> enabled for the x86_64 BPF JIT.
>
> This patch series enables the BPF prog pack allocator for the ARM64 BPF JIT.
>
> ====================================================
> Performance Analysis of prog pack allocator on ARM64
> ====================================================
>
> To test the performance of the BPF prog pack allocator on ARM64, a stresser
> tool[2] was built. This tool loads 8 BPF programs on the system and triggers
> 5 of them in an infinite loop by doing system calls.
>
> The runner script starts 20 instances of the above which loads 8*20=160 BPF
> programs on the system, 5*20=100 of which are being constantly triggered.
>
> In the above environment we try to build Python-3.8.4 and try to find different
> iTLB metrics for the compilation done by gcc-12.2.0.
>
> The source code[3] is  configured with the following command:
> ./configure --enable-optimizations --with-ensurepip=install
>
> Then the runner script is executed with the following command:
> ./run.sh "perf stat -e ITLB_WALK,L1I_TLB,INST_RETIRED,iTLB-load-misses -a make -j32"
>
> This builds Python while 160 BPF programs are loaded and 100 are being constantly
> triggered and measures iTLB related metrics.
>
> The output of the above command is discussed below before and after enabling the
> BPF prog pack allocator.
>
> The tests were run on qemu-system-aarch64 with 32 cpus, 4G memory, -machine virt,
> -cpu host, and -enable-kvm.
>
> Results
> -------
>
> Before enabling prog pack allocator:
> ------------------------------------
>
> Performance counter stats for 'system wide':
>
>          333278635      ITLB_WALK
>      6762692976558      L1I_TLB
>     25359571423901      INST_RETIRED
>        15824054789      iTLB-load-misses
>
>      189.029769053 seconds time elapsed
>
> After enabling prog pack allocator:
> -----------------------------------
>
> Performance counter stats for 'system wide':
>
>          190333544      ITLB_WALK
>      6712712386528      L1I_TLB
>     25278233304411      INST_RETIRED
>         5716757866      iTLB-load-misses
>
>      185.392650561 seconds time elapsed
>
> Improvements in metrics
> -----------------------
>
> Compilation time                             ---> 1.92% faster
> iTLB-load-misses/Sec (Less is better)        ---> 63.16% decrease
> ITLB_WALK/1000 INST_RETIRED (Less is better) ---> 42.71% decrease
> ITLB_Walk/L1I_TLB (Less is better)           ---> 42.47% decrease
>
> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/
> [2] https://github.com/puranjaymohan/BPF-Allocator-Bench
> [3] https://www.python.org/ftp/python/3.8.4/Python-3.8.4.tgz
>
> Chanes in V3 => V4: Changes only in 3rd patch
> 1. Fix the I-cache maintenance: Clean the data cache and invalidate the i-Cache
>    only *after* the instructions have been copied to the ROX region.
>
> Chanes in V2 => V3: Changes only in 3rd patch
> 1. Set prog = orig_prog; in the failure path of bpf_jit_binary_pack_finalize()
> call.
> 2. Add comments explaining the usage of the offsets in the exception table.
>
> Changes in v1 => v2:
> 1. Make the naming consistent in the 3rd patch:
>    ro_image and image
>    ro_header and header
>    ro_image_ptr and image_ptr
> 2. Use names dst/src in place of addr/opcode in second patch.
> 3. Add Acked-by: Song Liu <song@...nel.org> in 1st and 2nd patch.
>
> Puranjay Mohan (3):
>   bpf: make bpf_prog_pack allocator portable
>   arm64: patching: Add aarch64_insn_copy()
>   bpf, arm64: use bpf_jit_binary_pack_alloc
>
>  arch/arm64/include/asm/patching.h |   1 +
>  arch/arm64/kernel/patching.c      |  39 ++++++++
>  arch/arm64/net/bpf_jit_comp.c     | 145 +++++++++++++++++++++++++-----
>  kernel/bpf/core.c                 |   8 +-
>  4 files changed, 165 insertions(+), 28 deletions(-)
>
> --
> 2.40.1
>
>

FWIW

Acked-by: Florent Revest <revest@...omium.org>

Thanks for this Puranjay!

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ