[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6908562f-4a99-44ea-bffb-19f33fcffe83@linux.dev>
Date: Mon, 5 Jan 2026 17:06:49 -0800
From: Ihor Solodrai <ihor.solodrai@...ux.dev>
To: Nathan Chancellor <nathan@...nel.org>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman
<eddyz87@...il.com>, Yonghong Song <yonghong.song@...ux.dev>,
bpf@...r.kernel.org, linux-kernel@...r.kernel.org, llvm@...ts.linux.dev
Subject: Re: [PATCH bpf-next] scripts/gen-btf.sh: Disable LTO when generating
initial .o file
On 1/5/26 3:46 PM, Nathan Chancellor wrote:
> On Mon, Jan 05, 2026 at 02:01:36PM -0800, Ihor Solodrai wrote:
>> Hi Nathan, thank you for the patch.
>>
>> I'm starting to think it wasn't a good idea to do
>>
>> echo "" | ${CC} ...
>>
>> here, given the number of associated bugs.
>
> Yeah, I was wondering if a lack of KBUILD_CPPFLAGS would also be a
> problem since that contains the endianness flag for some targets. I
> cannot imagine any more issues than that but I can understand wanting to
> back out of it.
>
>> Before gen-btf.sh was introduced, the .btf.o binary was generated with this [1]:
>>
>> ${OBJCOPY} --only-section=.BTF --set-section-flags .BTF=alloc,readonly \
>> --strip-all ${1} "${btf_data}" 2>/dev/null
>>
>> I changed to ${CC} on the assumption it's a quicker operation than
>> stripping entire vmlinux. But maybe it's not worth it and we should
>> change back to --strip-all? wdyt?
>
> That certainly seems more robust to me. I see the logic but with
> '--only-section' and no glob, I would expect that to be a rather quick
> operation but I am running out of time today to test and benchmark such
> a change. I will try to do it tomorrow unless someone beats me to it.
I got curious and did a little experiment. Basically, I ran perf stat
on this part of gen-btf.sh:
echo "" | ${CC} ${CLANG_FLAGS} ${KBUILD_CFLAGS} -c -x c -o ${btf_data} -
${OBJCOPY} --add-section .BTF=${ELF_FILE}.BTF \
--set-section-flags .BTF=alloc,readonly ${btf_data}
${OBJCOPY} --only-section=.BTF --strip-all ${btf_data}
Replacing ${CC} command with:
${OBJCOPY} --strip-all "${ELF_FILE}" ${btf_data} 2>/dev/null
for comparison.
TL;DR is that using ${CC} is:
* about 1.5x faster than GNU objcopy --strip-all .tmp_vmlinux1
* about 16x (!) faster than llvm-objcopy --strip-all .tmp_vmlinux1
With obvious caveats that this is a particular machine (Threadripper
PRO 3975WX), toolchain etc:
* clang version 21.1.7
* gcc (GCC) 15.2.1 20251211
This is bpf-next (a069190b590e) with BPF CI-like kconfig.
Pasting perf stat output below.
# llvm-objcopy --strip-all
$ perf stat -r 31 -- ./gen-btf.o_strip.sh
Performance counter stats for './gen-btf.o_strip.sh' (31 runs):
1,300,945,256 task-clock:u # 0.962 CPUs utilized ( +- 0.10% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
327,311 page-faults:u # 251.595 K/sec ( +- 0.00% )
1,532,927,570 instructions:u # 1.33 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.00% )
1,155,639,083 cycles:u # 0.888 GHz ( +- 0.18% )
53,144,866 stalled-cycles-frontend:u # 4.60% frontend cycles idle ( +- 0.99% )
297,229,466 branches:u # 228.472 M/sec ( +- 0.00% )
903,337 branch-misses:u # 0.30% of all branches ( +- 0.02% )
1.35200 +- 0.00137 seconds time elapsed ( +- 0.10% )
# GNU objcopy --strip-all
$ perf stat -r 31 -- ./gen-btf.o_strip.sh
Performance counter stats for './gen-btf.o_strip.sh' (31 runs):
119,747,488 task-clock:u # 0.970 CPUs utilized ( +- 0.41% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
9,186 page-faults:u # 76.711 K/sec ( +- 0.01% )
132,651,881 instructions:u # 1.68 insn per cycle
# 0.08 stalled cycles per insn ( +- 0.00% )
79,191,259 cycles:u # 0.661 GHz ( +- 1.06% )
10,136,981 stalled-cycles-frontend:u # 12.80% frontend cycles idle ( +- 2.58% )
28,422,807 branches:u # 237.356 M/sec ( +- 0.00% )
354,981 branch-misses:u # 1.25% of all branches ( +- 0.02% )
0.123415 +- 0.000564 seconds time elapsed ( +- 0.46% )
# echo "" | clang ...
$ perf stat -r 31 -- ./gen-btf.o_llvm.sh
Performance counter stats for './gen-btf.o_llvm.sh' (31 runs):
62,107,490 task-clock:u # 0.774 CPUs utilized ( +- 0.31% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
9,755 page-faults:u # 157.066 K/sec ( +- 0.01% )
88,196,854 instructions:u # 1.18 insn per cycle
# 0.19 stalled cycles per insn ( +- 0.00% )
74,944,793 cycles:u # 1.207 GHz ( +- 0.50% )
16,494,448 stalled-cycles-frontend:u # 22.01% frontend cycles idle ( +- 0.48% )
17,914,949 branches:u # 288.451 M/sec ( +- 0.00% )
459,548 branch-misses:u # 2.57% of all branches ( +- 0.10% )
0.080237 +- 0.000313 seconds time elapsed ( +- 0.39% )
# echo "" | gcc ...
$ perf stat -r 31 -- ./gen-btf.o_gnu.sh
Performance counter stats for './gen-btf.o_gnu.sh' (31 runs):
53,683,797 task-clock:u # 0.770 CPUs utilized ( +- 0.33% )
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,390 page-faults:u # 156.286 K/sec ( +- 0.01% )
69,398,474 instructions:u # 1.22 insn per cycle
# 0.17 stalled cycles per insn ( +- 0.00% )
56,763,954 cycles:u # 1.057 GHz ( +- 0.39% )
12,103,546 stalled-cycles-frontend:u # 21.32% frontend cycles idle ( +- 0.47% )
14,064,366 branches:u # 261.985 M/sec ( +- 0.00% )
347,383 branch-misses:u # 2.47% of all branches ( +- 0.09% )
0.069735 +- 0.000253 seconds time elapsed ( +- 0.36% )
>
> Cheers,
> Nathan
Powered by blists - more mailing lists