linux-kernel - Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQ+--s_zGdRg4VHv3H317dCrx_+nEGH7FNYzdywkdh3n-A@mail.gmail.com>
Date: Thu, 12 Jun 2025 17:25:07 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Pavel Begunkov <asml.silence@...il.com>, Andrii Nakryiko <andrii@...nel.org>
Cc: io-uring@...r.kernel.org, Martin KaFai Lau <martin.lau@...ux.dev>, 
	bpf <bpf@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v2 5/5] io_uring/bpf: add basic kfunc helpers

On Thu, Jun 12, 2025 at 6:25 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>
> On 6/12/25 03:47, Alexei Starovoitov wrote:
> > On Fri, Jun 6, 2025 at 6:58 AM Pavel Begunkov <asml.silence@...il.com> wrote:
> ...>> +__bpf_kfunc
> >> +struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
> >> +{
> >> +       struct io_rings *rings = ctx->rings;
> >> +       unsigned int mask = ctx->cq_entries - 1;
> >> +       unsigned head = rings->cq.head;
> >> +       struct io_uring_cqe *cqe;
> >> +
> >> +       /* TODO CQE32 */
> >> +       if (head == rings->cq.tail)
> >> +               return NULL;
> >> +
> >> +       cqe = &rings->cqes[head & mask];
> >> +       rings->cq.head++;
> >> +       return cqe;
> >> +}
> >> +
> >> +__bpf_kfunc_end_defs();
> >> +
> >> +BTF_KFUNCS_START(io_uring_kfunc_set)
> >> +BTF_ID_FLAGS(func, bpf_io_uring_submit_sqes, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_post_cqe, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_queue_sqe, KF_SLEEPABLE);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_get_cqe, 0);
> >> +BTF_ID_FLAGS(func, bpf_io_uring_extract_next_cqe, KF_RET_NULL);
> >> +BTF_KFUNCS_END(io_uring_kfunc_set)
> >
> > This is not safe in general.
> > The verifier doesn't enforce argument safety here.
> > As a minimum you need to add KF_TRUSTED_ARGS flag to all kfunc.
> > And once you do that you'll see that the verifier
> > doesn't recognize the cqe returned from bpf_io_uring_get_cqe*()
> > as trusted.
>
> Thanks, will add it. If I read it right, without the flag the
> program can, for example, create a struct io_ring_ctx on stack,
> fill it with nonsense and pass to kfuncs. Is that right?

No. The verifier will only allow a pointer to struct io_ring_ctx
to be passed, but it may not be fully trusted.

The verifier has 3 types of pointers to kernel structures:
1. ptr_to_btf_id
2. ptr_to_btf_id | trusted
3. ptr_to_btf_id | untrusted

1st was added long ago for tracing and gradually got adopted
for non-tracing needs, but it has a foot gun, since
all pointer walks keep ptr_to_btf_id type.
It's fine in some cases to follow pointers, but not in all.
Hence 2nd variant was added and there
foo->bar dereference needs to be explicitly allowed
instead of allowed by default like for 1st kind.

All loads through 1 and 3 are implemented as probe_read_kernel.
while loads from 2 are direct loads.

So kfuncs without KF_TRUSTED_ARGS with struct io_ring_ctx *ctx
argument are likely fine and safe, since it's impossible
to get this io_ring_ctx pointer by dereferencing some other pointer.
But better to tighten safety from the start.
We recommend KF_TRUSTED_ARGS for all kfuncs and
eventually it will be the default.

> > Looking at your example:
> > https://github.com/axboe/liburing/commit/706237127f03e15b4cc9c7c31c16d34dbff37cdc
> > it doesn't care about contents of cqe and doesn't pass it further.
> > So sort-of ok-ish right now,
> > but if you need to pass cqe to another kfunc
> > you would need to add an open coded iterator for cqe-s
> > with appropriate KF_ITER* flags
> > or maybe add acquire/release semantics for cqe.
> > Like, get_cqe will be KF_ACQUIRE, and you'd need
> > matching KF_RELEASE kfunc,
> > so that 'cqe' is not lost.
> > Then 'cqe' will be trusted and you can pass it as actual 'cqe'
> > into another kfunc.
> > Without KF_ACQUIRE the verifier sees that get_cqe*() kfuncs
> > return 'struct io_uring_cqe *' and it's ok for tracing
> > or passing into kfuncs like bpf_io_uring_queue_sqe()
> > that don't care about a particular type,
> > but not ok for full tracking of objects.
>
> I don't need type safety for SQEs / CQEs, they're supposed to be simple
> memory blobs containing userspace data only. SQ / CQ are shared with
> userspace, and the kfuncs can leak the content of passed CQE / SQE to
> userspace. But I'd like to find a way to reject programs stashing
> kernel pointers / data into them.

That's impossible.
If you're worried about bpf prog exposing kernel addresses
to user space then abort the whole thing.
CAP_PERFMON is required for the majority of bpf progs.

>
> BPF_PROG(name, struct io_ring_ctx *io_ring)
> {
>      struct io_uring_sqe *cqe = ...;
>      cqe->user_data = io_ring;
>      cqe->res = io_ring->private_field;
> }
>
> And I mentioned in the message, I rather want to get rid of half of the
> kfuncs, and give BPF direct access to the SQ/CQ instead. Schematically
> it should look like this:
>
> BPF_PROG(name, struct io_ring_ctx *ring)
> {
>      struct io_uring_sqe *sqes = get_SQ(ring);
>
>      sqes[ring->sq_tail]->opcode = OP_NOP;
>      bpf_kfunc_submit_sqes(ring, 1);
>
>      struct io_uring_cqe *cqes = get_CQ(ring);
>      print_cqe(&cqes[ring->cq_head]);
> }
>
> I hacked up RET_PTR_TO_MEM for kfuncs, the diff is below, but it'd be
> great to get rid of the constness of the size argument. I need to
> digest arenas first as conceptually they look very close.

arena is a special memory region where every byte is writeable
by user space.

>
> > For next revision please post all selftest, examples,
> > and bpf progs on the list,
> > so people don't need to search github.
>
> Did the link in the cover letter not work for you? I'm confused
> since it's all in a branch in my tree, but you linked to the same
> patches but in Jens' tree, and I have zero clue what they're
> doing there or how you found them.

External links can disappear. It's not good for reviewers and
for keeping the history of conversation.

>
> diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> index 9494e4289605..400a06a74b5d 100644
> --- a/io_uring/bpf.c
> +++ b/io_uring/bpf.c
> @@ -2,6 +2,7 @@
>   #include <linux/bpf_verifier.h>
>
>   #include "io_uring.h"
> +#include "memmap.h"
>   #include "bpf.h"
>   #include "register.h"
>
> @@ -72,6 +73,14 @@ struct io_uring_cqe *bpf_io_uring_extract_next_cqe(struct io_ring_ctx *ctx)
>         return cqe;
>   }
>
> +__bpf_kfunc
> +void *bpf_io_uring_get_region(struct io_ring_ctx *ctx, u64 size__retsz)
> +{
> +       if (size__retsz > ((u64)ctx->ring_region.nr_pages << PAGE_SHIFT))
> +               return NULL;
> +       return io_region_get_ptr(&ctx->ring_region);
> +}

and bpf prog should be able to read/write anything in
[ctx->ring_region->ptr, ..ptr + size] region ?

Populating (creating) dynptr is probably better.
See bpf_dynptr_from*()

but what is the lifetime of that memory ?