[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a40bd241-6b31-1542-ff81-f4f615268d84@huawei.com>
Date: Thu, 13 Oct 2022 10:07:12 +0800
From: Xu Kuohai <xukuohai@...wei.com>
To: Daniel Borkmann <daniel@...earbox.net>,
Xu Kuohai <xukuohai@...weicloud.com>, <bpf@...r.kernel.org>,
<linux-arm-kernel@...ts.infradead.org>,
<linux-kernel@...r.kernel.org>
CC: Jean-Philippe Brucker <jean-philippe@...aro.org>,
Alexei Starovoitov <ast@...nel.org>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>,
Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...gle.com>,
Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Zi Shen Lim <zlim.lnx@...il.com>,
Catalin Marinas <catalin.marinas@....com>,
Will Deacon <will@...nel.org>, <mark.rutland@....com>,
<revest@...omium.org>
Subject: Re: [RESEND PATCH bpf-next 1/2] bpf, arm64: Jit BPF_CALL to direct
call when possible
On 9/27/2022 10:01 PM, Xu Kuohai wrote:
> On 9/27/2022 4:29 AM, Daniel Borkmann wrote:
>> [ +Mark/Florent ]
>>
>> On 9/19/22 11:21 AM, Xu Kuohai wrote:
>>> From: Xu Kuohai <xukuohai@...wei.com>
>>>
>>> Currently BPF_CALL is always jited to indirect call, but when target is
>>> in the range of direct call, BPF_CALL can be jited to direct call.
>>>
>>> For example, the following BPF_CALL
>>>
>>> call __htab_map_lookup_elem
>>>
>>> is always jited to an indirect call:
>>>
>>> mov x10, #0xffffffffffff18f4
>>> movk x10, #0x821, lsl #16
>>> movk x10, #0x8000, lsl #32
>>> blr x10
>>>
>>> When the target is in the range of direct call, it can be jited to:
>>>
>>> bl 0xfffffffffd33bc98
>>>
>>> This patch does such jit when possible.
>>>
>>> 1. First pass, get the maximum jited image size. Since the jited image
>>> memory is not allocated yet, the distance between jited BPF_CALL
>>> instructon and call target is unknown, so jit all BPF_CALL to indirect
>>> call to get the maximum image size.
>>>
>>> 2. Allocate image memory with the size caculated in step 1.
>>>
>>> 3. Second pass, determine the jited address and size for every bpf instruction.
>>> Since image memory is now allocated and there is only one jit method for
>>> bpf instructions other than BPF_CALL, so the jited address for the first
>>> BPF_CALL is determined, so the distance to call target is determined, so
>>> the first BPF_CALL is determined to be jited to direct or indirect call,
>>> so the jited image size after the first BPF_CALL is determined. By analogy,
>>> the jited addresses and sizes for all subsequent BPF instructions are
>>> determined.
>>>
>>> 4. Last pass, generate the final image. The jump offset of jump instruction
>>> whose target is within the jited image is determined in this pass, since
>>> the target instruction address may be changed in step 3.
>>
>> Wouldn't this require similar convergence process like in x86-64 JIT? You state
>> the jump instructions are placed in step 4 because step 3 could have changed their
>> offsets, but then after step 4, couldn't also again the offsets have changed for
>> the target addresses from 3 again in some corner cases (given emit_a64_mov_i() is
>> used also in jump encoding)?
>>
>
> IIUC, the reason why there is a convergence process on x86 is that x86's jmp
> instruction length varies with the size of immediate part, so after immediate
> part is adjusted, the instruction length may change accordingly, and consequently
> cause the positions of subsequent instructions to change, which in turn causes
> the distance between instructions to change. However, arm64's instruction size
> is fixed to 4 bytes and does not change with immediate part changes. So adjusting
> the immediate part of arm64 jump instruction does not result in a change in
> instruction length or position.
>
> For BPF_CALL, arguments passed to emit_call() and emit_a64_mov_i() (if called)
> do not change in pass 3 and 4, so the jited result does not change. This is also
> true for other non-BPF_JMP instructions.
>
> So no convergence is required on arm64.
>
Hi Daniel,
I think I should make it more clear.
Please take a look at the following code snippet, which jits BPF_JMP instructions
to arm64 instructions.
The code can be divided into two parts: the part where instruction offset jmp_offset
is used and the part where jmp_offset is not used.
1. Lines 963-966 and lines 990-1028 use jmp_offset. We can see that no matter what
value of jmp_offset is, the jited result is emitted either at line 965 or at
line 1027, which is exactly one arm64 instruction, that is, the jited size is
always 4 bytes.
2. The other lines don't use jmp_offset. We can see that the input arguments,
including arguments passed to emit_a64_mov_i and emit_call, do not change in
pass 3 and pass 4, so the jited result also do not change.
961 /* JUMP off */
962 case BPF_JMP | BPF_JA:
963 jmp_offset = bpf2a64_offset(i, off, ctx);
964 check_imm26(jmp_offset);
965 emit(A64_B(jmp_offset), ctx);
966 break;
967 /* IF (dst COND src) JUMP off */
968 case BPF_JMP | BPF_JEQ | BPF_X:
969 case BPF_JMP | BPF_JGT | BPF_X:
970 case BPF_JMP | BPF_JLT | BPF_X:
971 case BPF_JMP | BPF_JGE | BPF_X:
972 case BPF_JMP | BPF_JLE | BPF_X:
973 case BPF_JMP | BPF_JNE | BPF_X:
974 case BPF_JMP | BPF_JSGT | BPF_X:
975 case BPF_JMP | BPF_JSLT | BPF_X:
976 case BPF_JMP | BPF_JSGE | BPF_X:
977 case BPF_JMP | BPF_JSLE | BPF_X:
978 case BPF_JMP32 | BPF_JEQ | BPF_X:
979 case BPF_JMP32 | BPF_JGT | BPF_X:
980 case BPF_JMP32 | BPF_JLT | BPF_X:
981 case BPF_JMP32 | BPF_JGE | BPF_X:
982 case BPF_JMP32 | BPF_JLE | BPF_X:
983 case BPF_JMP32 | BPF_JNE | BPF_X:
984 case BPF_JMP32 | BPF_JSGT | BPF_X:
985 case BPF_JMP32 | BPF_JSLT | BPF_X:
986 case BPF_JMP32 | BPF_JSGE | BPF_X:
987 case BPF_JMP32 | BPF_JSLE | BPF_X:
988 emit(A64_CMP(is64, dst, src), ctx);
989 emit_cond_jmp:
990 jmp_offset = bpf2a64_offset(i, off, ctx);
991 check_imm19(jmp_offset);
992 switch (BPF_OP(code)) {
993 case BPF_JEQ:
994 jmp_cond = A64_COND_EQ;
995 break;
996 case BPF_JGT:
997 jmp_cond = A64_COND_HI;
998 break;
999 case BPF_JLT:
1000 jmp_cond = A64_COND_CC;
1001 break;
1002 case BPF_JGE:
1003 jmp_cond = A64_COND_CS;
1004 break;
1005 case BPF_JLE:
1006 jmp_cond = A64_COND_LS;
1007 break;
1008 case BPF_JSET:
1009 case BPF_JNE:
1010 jmp_cond = A64_COND_NE;
1011 break;
1012 case BPF_JSGT:
1013 jmp_cond = A64_COND_GT;
1014 break;
1015 case BPF_JSLT:
1016 jmp_cond = A64_COND_LT;
1017 break;
1018 case BPF_JSGE:
1019 jmp_cond = A64_COND_GE;
1020 break;
1021 case BPF_JSLE:
1022 jmp_cond = A64_COND_LE;
1023 break;
1024 default:
1025 return -EFAULT;
1026 }
1027 emit(A64_B_(jmp_cond, jmp_offset), ctx);
1028 break;
1029 case BPF_JMP | BPF_JSET | BPF_X:
1030 case BPF_JMP32 | BPF_JSET | BPF_X:
1031 emit(A64_TST(is64, dst, src), ctx);
1032 goto emit_cond_jmp;
1033 /* IF (dst COND imm) JUMP off */
1034 case BPF_JMP | BPF_JEQ | BPF_K:
1035 case BPF_JMP | BPF_JGT | BPF_K:
1036 case BPF_JMP | BPF_JLT | BPF_K:
1037 case BPF_JMP | BPF_JGE | BPF_K:
1038 case BPF_JMP | BPF_JLE | BPF_K:
1039 case BPF_JMP | BPF_JNE | BPF_K:
1040 case BPF_JMP | BPF_JSGT | BPF_K:
1041 case BPF_JMP | BPF_JSLT | BPF_K:
1042 case BPF_JMP | BPF_JSGE | BPF_K:
1043 case BPF_JMP | BPF_JSLE | BPF_K:
1044 case BPF_JMP32 | BPF_JEQ | BPF_K:
1045 case BPF_JMP32 | BPF_JGT | BPF_K:
1046 case BPF_JMP32 | BPF_JLT | BPF_K:
1047 case BPF_JMP32 | BPF_JGE | BPF_K:
1048 case BPF_JMP32 | BPF_JLE | BPF_K:
1049 case BPF_JMP32 | BPF_JNE | BPF_K:
1050 case BPF_JMP32 | BPF_JSGT | BPF_K:
1051 case BPF_JMP32 | BPF_JSLT | BPF_K:
1052 case BPF_JMP32 | BPF_JSGE | BPF_K:
1053 case BPF_JMP32 | BPF_JSLE | BPF_K:
1054 if (is_addsub_imm(imm)) {
1055 emit(A64_CMP_I(is64, dst, imm), ctx);
1056 } else if (is_addsub_imm(-imm)) {
1057 emit(A64_CMN_I(is64, dst, -imm), ctx);
1058 } else {
1059 emit_a64_mov_i(is64, tmp, imm, ctx);
1060 emit(A64_CMP(is64, dst, tmp), ctx);
1061 }
1062 goto emit_cond_jmp;
1063 case BPF_JMP | BPF_JSET | BPF_K:
1064 case BPF_JMP32 | BPF_JSET | BPF_K:
1065 a64_insn = A64_TST_I(is64, dst, imm);
1066 if (a64_insn != AARCH64_BREAK_FAULT) {
1067 emit(a64_insn, ctx);
1068 } else {
1069 emit_a64_mov_i(is64, tmp, imm, ctx);
1070 emit(A64_TST(is64, dst, tmp), ctx);
1071 }
1072 goto emit_cond_jmp;
1073 /* function call */
1074 case BPF_JMP | BPF_CALL:
1075 {
1076 const u8 r0 = bpf2a64[BPF_REG_0];
1077 bool func_addr_fixed;
1078 u64 func_addr;
1079
1080 ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
1081 &func_addr, &func_addr_fixed);
1082 if (ret < 0)
1083 return ret;
1084 emit_call(func_addr, ctx);
1085 emit(A64_MOV(1, r0, A64_R(0)), ctx);
1086 break;
1087 }
1088 /* tail call */
1089 case BPF_JMP | BPF_TAIL_CALL:
1090 if (emit_bpf_tail_call(ctx))
1091 return -EFAULT;
1092 break;
1093 /* function return */
1094 case BPF_JMP | BPF_EXIT:
1095 /* Optimization: when last instruction is EXIT,
1096 simply fallthrough to epilogue. */
1097 if (i == ctx->prog->len - 1)
1098 break;
1099 jmp_offset = epilogue_offset(ctx);
1100 check_imm26(jmp_offset);
1101 emit(A64_B(jmp_offset), ctx);
1102 break;
In fact, what happens in step 3 and step 4 is almost the same as what happened in
pass 1 and pass 2 before this series, where there is no convergence either.
>>> Tested with test_bpf.ko and some arm64 working selftests, nothing failed.
>
> [...]
>
> .
Powered by blists - more mailing lists