[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fcec81dd-3bb9-7dcf-139d-847538b6ad20@fb.com>
Date: Wed, 3 Nov 2021 10:29:28 -0700
From: Yonghong Song <yhs@...com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Joe Burton <jevburton.kernel@...il.com>
CC: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <kafai@...com>,
Song Liu <songliubraving@...com>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>, <linux-kernel@...r.kernel.org>,
<netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
Petar Penkov <ppenkov@...gle.com>,
Stanislav Fomichev <sdf@...gle.com>,
Joe Burton <jevburton@...gle.com>
Subject: Re: [RFC PATCH v3 0/3] Introduce BPF map tracing capability
On 11/2/21 5:12 PM, Alexei Starovoitov wrote:
> On Tue, Nov 02, 2021 at 02:14:29AM +0000, Joe Burton wrote:
>> From: Joe Burton <jevburton@...gle.com>
>>
>> This is the third version of a patch series implementing map tracing.
>>
>> Map tracing enables executing BPF programs upon BPF map updates. This
>> might be useful to perform upgrades of stateful programs; e.g., tracing
>> programs can propagate changes to maps that occur during an upgrade
>> operation.
>>
>> This version uses trampoline hooks to provide the capability.
>> fentry/fexit/fmod_ret programs can attach to two new functions:
>> int bpf_map_trace_update_elem(struct bpf_map* map, void* key,
>> void* val, u32 flags);
>> int bpf_map_trace_delete_elem(struct bpf_map* map, void* key);
>>
>> These hooks work as intended for the following map types:
>> BPF_MAP_TYPE_ARRAY
>> BPF_MAP_TYPE_PERCPU_ARRAY
>> BPF_MAP_TYPE_HASH
>> BPF_MAP_TYPE_PERCPU_HASH
>> BPF_MAP_TYPE_LRU_HASH
>> BPF_MAP_TYPE_LRU_PERCPU_HASH
>>
>> The only guarantee about the semantics of these hooks is that they execute
>> before the operation takes place. We cannot call them with locks held
>> because the hooked program might try to acquire the same locks. Thus they
>> may be invoked in situations where the traced map is not ultimately
>> updated.
>>
>> The original proposal suggested exposing a function for each
>> (map type) x (access type). The problem I encountered is that e.g.
>> percpu hashtables use a custom function for some access types
>> (htab_percpu_map_update_elem) but a common function for others
>> (htab_map_delete_elem). Thus a userspace application would have to
>> maintain a unique list of functions to attach to for each map type;
>> moreover, this list could change across kernel versions. Map tracing is
>> easier to use with fewer functions, at the cost of tracing programs
>> being triggered more times.
>
> Good point about htab_percpu.
> The patches look good to me.
> Few minor bits:
> - pls don't use #pragma once.
> There was a discussion not too long ago about it and the conclusion
> was that let's not use it.
> It slipped into few selftest/bpf, but let's not introduce more users.
> - noinline is not needed in prototype.
> - bpf_probe_read is deprecated. Pls use bpf_probe_read_kernel.
>
> and thanks for detailed patch 3.
>
>> To prevent the compiler from optimizing out the calls to my tracing
>> functions, I use the asm("") trick described in gcc's
>> __attribute__((noinline)) documentation. Experimentally, this trick
>> works with clang as well.
>
> I think noinline is enough. I don't think you need that asm in there.
I tried a simple program using clang lto and the optimization
(optimizing away the call itself) doesn't happen.
[$ ~/tmp2] cat t1.c
__attribute__((noinline)) int foo() {
return 0;
}
[$ ~/tmp2] cat t2.c
extern int foo();
int main() {
return foo();
}
[$ ~/tmp2] cat run.sh
clang -flto=full -O2 t1.c t2.c -c
clang -flto=full -fuse-ld=lld -O2 t1.o t2.o -o a.out
[$ ~/tmp2] ./run.sh
[$ ~/tmp2] llvm-objdump -d a.out
...
0000000000201750 <foo>:
201750: 31 c0 xorl %eax, %eax
201752: c3 retq
201753: cc int3
201754: cc int3
201755: cc int3
201756: cc int3
201757: cc int3
201758: cc int3
201759: cc int3
20175a: cc int3
20175b: cc int3
20175c: cc int3
20175d: cc int3
20175e: cc int3
20175f: cc int3
0000000000201760 <main>:
201760: e9 eb ff ff ff jmp 0x201750 <foo>
I remember that even if a call is marked as noinline, the compiler might
still poke into the call to find some information for some optimization.
But I guess probably the callsite will be kept. Otherwise, it will be
considered as "inlining".
Joe, did you hit any issues, esp. with gcc lto?
>
> In parallel let's figure out how to do:
> SEC("fentry/bpf_map_trace_update_elem")
> int BPF_PROG(copy_on_write__update,
> struct bpf_map *map,
> struct allow_reads_key__old *key,
> void *value, u64 map_flags)
>
> It kinda sucks that bpf_probe_read_kernel is necessary to read key/values.
> It would be much nicer to be able to specify the exact struct for the key and
> access it directly.
> The verifier does this already for map iterator.
> It's 'void *' on the kernel side while iterator prog can cast this pointer
> to specific 'struct key *' and access it directly.
> See bpf_iter_reg->ctx_arg_info and btf_ctx_access().
>
> For fentry into bpf_map_trace_update_elem it's a bit more challenging,
> since it will be called for all maps and there is no way to statically
> check that specific_map->key_size is within prog->aux->max_rdonly_access.
>
> May be we can do a dynamic cast helper (simlar to those that cast sockets)
> that will check for key_size at run-time?
> Another alternative is to allow 'void *' -> PTR_TO_BTF_ID conversion
> and let inlined probe_read do the job.
>
Powered by blists - more mailing lists