linux-kernel - Re: [RFC PATCH bpf-next 4/6] bpf: Add bpf runtime hooks for tracking runtime acquire/release

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP01T76m7OP_u8C1hJMrpVqJGf77W00DE9qB-8Yq6Cd-BMQ=7g@mail.gmail.com>
Date: Sat, 1 Mar 2025 02:23:16 +0100
From: Kumar Kartikeya Dwivedi <memxor@...il.com>
To: Juntong Deng <juntong.deng@...look.com>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, John Fastabend <john.fastabend@...il.com>, 
	Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau <martin.lau@...ux.dev>, Eddy Z <eddyz87@...il.com>, 
	Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>, KP Singh <kpsingh@...nel.org>, 
	Stanislav Fomichev <sdf@...ichev.me>, Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>, 
	snorcht@...il.com, bpf <bpf@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH bpf-next 4/6] bpf: Add bpf runtime hooks for tracking
 runtime acquire/release

On Fri, 28 Feb 2025 at 20:00, Juntong Deng <juntong.deng@...look.com> wrote:
>
> On 2025/2/28 03:34, Alexei Starovoitov wrote:
> > On Thu, Feb 27, 2025 at 1:55 PM Juntong Deng <juntong.deng@...look.com> wrote:
> >>
> >> I have an idea, though not sure if it is helpful.
> >>
> >> (This idea is for the previous problem of holding references too long)
> >>
> >> My idea is to add a new KF_FLAG, like KF_ACQUIRE_EPHEMERAL, as a
> >> special reference that can only be held for a short time.
> >>
> >> When a bpf program holds such a reference, the bpf program will not be
> >> allowed to enter any new logic with uncertain runtime, such as bpf_loop
> >> and the bpf open coded iterator.
> >>
> >> (If the bpf program is already in a loop, then no problem, as long as
> >> the bpf program doesn't enter a new nested loop, since the bpf verifier
> >> guarantees that references must be released in the loop body)
> >>
> >> In addition, such references can only be acquired and released between a
> >> limited number of instructions, e.g., 300 instructions.
> >
> > Not much can be done with few instructions.
> > Number of insns is a coarse indicator of time. If there are calls
> > they can take a non-trivial amount of time.
>
> Yes, you are right, limiting the number of instructions is not
> a good idea.
>
> > People didn't like CRIB as a concept. Holding a _regular_ file refcnt for
> > the duration of the program is not a problem.
> > Holding special files might be, since they're not supposed to be held.
> > Like, is it safe to get_file() userfaultfd ? It needs in-depth
> > analysis and your patch didn't provide any confidence that
> > such analysis was done.
> >
>
> I understand, I will try to analyze it in depth.
>
> > Speaking of more in-depth analysis of the problem.
> > In the cover letter you mentioned bpf_throw and exceptions as
> > one of the way to terminate the program, but there was another
> > proposal:
> > https://lpc.events/event/17/contributions/1610/
> >
> > aka accelerated execution or fast-execute.
> > After the talk at LPC there were more discussions and follow ups.
> >
> > Roughly the idea is the following,
> > during verification determine all kfuncs, helpers that
> > can be "speed up" and replace them with faster alternatives.
> > Like bpf_map_lookup_elem() can return NULL in the fast-execution version.
> > All KF_ACQUIRE | KF_RET_NULL can return NULL to.
> > bpf_loop() can end sooner.
> > bpf_*_iter_next() can return NULL,
> > etc
> >
> > Then at verification time create such a fast-execute
> > version of the program with 1-1 mapping of IPs / instructions.
> > When a prog needs to be cancelled replace return IP
> > to IP in fast-execute version.
> > Since all regs are the same, continuing in the fast-execute
> > version will release all currently held resources
> > and no need to have either run-time (like this patch set)
> > or exception style (resource descriptor collection of resources)
> > bookkeeping to release.
> > The program itself is going to release whatever it acquired.
> > bpf_throw does manual stack unwind right now.
> > No need for that either. Fast-execute will return back
> > all the way to the kernel hook via normal execution path.
> >
> > Instead of patching return IP in the stack,
> > we can text_poke_bp() the code of the original bpf prog to
> > jump to the fast-execute version at corresponding IP/insn.
> >
> > The key insight is that cancellation doesn't mean
> > that the prog stops completely. It continues, but with
> > an intent to finish as quickly as possible.
> > In practice it might be faster to do that
> > than walk your acquired hash table and call destructors.
> >
> > Another important bit is that control flow is unchanged.
> > Introducing new edge in a graph is tricky and error prone.
> >
> > All details need to be figured out, but so far it looks
> > to be the cleanest and least intrusive solution to program
> > cancellation.
> > Would you be interested in helping us design/implement it?
>
> This is an amazing idea.
>
> I am very interested in this.
>
> But I think we may not need a fast-execute version of the bpf program
> with 1-1 mapping.
>
> Since we are going to modify the code of the bpf program through
> text_poke_bp, we can directly modify all relevant CALL instructions in
> the bpf program, just like the BPF runtime hook does.

Cloning the text allows you to not make the modifications globally
visible, in case we want to support cancellations local to a CPU.
So there is a material difference

You can argue for and against local/global cancellations, therefore it
seems we should not bind early to one specific choice and keep options
open.
It is tied to how one views BPF program execution.
Whether a single execution of the program constitutes an isolated
invocation, or whether all invocations in parallel should be affected
due to a cancellation event.
The answer may lie in how the cancellation was triggered.

Here's an anecdote:
At least when I was (or am) using this, and when I have assertions in
the program (to prove some verification property, some runtime
condition, or simply for my program logic), it was better if the
cancellation was local (and triggered synchronously on a throw). In
comparison, when I did cancellations on page faults into arena/heap
loads, the program is clearly broken, so it seemed better to rip it
out (but in my case I still chose to do that as a separate step, to
not mess with parallel invocations of the program that may still be
functioning correctly).

Unlike user space which has a clear boundary against the kernel, BPF
programs have side effects and can influence the kernel's control
flow, so "crashing" them has a semantic implication for the kernel.

>
> For example, when we need to cancel the execution of a bpf program,
> we can "live patch" the bpf program and replace the target address
> in all CALL instructions that call KF_ACQUIRE and bpf_*_iter_next()
> with the address of a stub function that always returns NULL.
>
> During the JIT process, we can record the locations of all CALL
> instructions that may potentially be "live patched".
>
> This seems not difficult to do. The location (ip) of the CALL
> instruction can be obtained by image + addrs[i - 1].
>
> BPF_CALL ip = ffffffffc00195f1, kfunc name = bpf_task_from_pid
> bpf_task_from_pid return address = ffffffffc00195f6
>
> I did a simple experiment to verify the feasibility of this method.
> In the above results, the return address of bpf_task_from_pid is
> the location after the CALL instruction (ip), which means that the
> ip recorded during the JIT process is correct.
>
> After I complete a full proof of concept, I will send out the patch
> series and let's see what happens.

We should also think about whether removing the exceptions support makes sense.
Since it's not complete upstream (in terms of releasing held resources), it
hasn't found much use (except whatever I tried to use it for).
There would be some exotic use cases (like using it to prove to the
verifier some precondition on some kernel resource), but that wouldn't
be a justification to keep it around.

One of the original use cases was asserting that a map return value is not NULL.
The most pressing case is already solved by making the verifier
smarter for array maps.

As such there may not be much value, so it might be better to just
drop that code altogether and simplify the verifier if this approach
seems viable and lands.
Since it's all exposed through kfuncs, there's no UAPI constraint.


>
> But it may take some time as I am busy with my university
> stuff recently.