linux-kernel - Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAADnVQJ68X6NPYtEbQPXPM4pH1ZPg5iSrYi8c3EanL51SAW7zQ@mail.gmail.com>
Date: Mon, 8 Apr 2024 11:24:19 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Marco Elver <elver@...gle.com>
Cc: Andrii Nakryiko <andrii.nakryiko@...il.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>, 
	Jiri Olsa <jolsa@...nel.org>, Dmitry Vyukov <dvyukov@...gle.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, bpf <bpf@...r.kernel.org>, 
	"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, linux-trace-kernel@...r.kernel.org, 
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <elver@...gle.com> wrote:
>
> On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <andrii.nakryiko@...il.com> wrote:
> >
> > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <elver@...gle.com> wrote:
> > >
> > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > <alexei.starovoitov@...il.com> wrote:
> [...]
> > > > and the tasks can use mmaped array shared across all or unique to each
> > > > process.
> > > > And both bpf and user space can read/write them with a single instruction.
> > >
> > > That's BPF_F_MMAPABLE, right?
> > >
> > > That does not work because the mmapped region is global. Our requirements are:

It sounds not like "requirements", but a description of the proposed
solution.
Pls share the actual use case.
This "tracing prog" sounds more like a ghost scheduler that
wants to interact with known user processes.

> > >
> > > 1. Single tracing BPF program.
> > >
> > > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > > thread just registers the same process-wide region).  No sharing
> > > between processes.
> > >
> > > 3. From #2 it follows: exec unregisters the registered memory region;
> > > fork gets a cloned region.
> > >
> > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > > not be able to use the bpf syscall.
> > >
> > > The reason for #2 is that each user space process also writes to the
> > > memory region (read by the BPF program to make updates depending on
> > > what state it finds), and having shared state between processes
> > > doesn't work here.
> > >
> > > Is there any reasonable BPF facility that can do this today? (If
> > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > > happy camper.)
> >
> > You could simulate something like this with multi-element ARRAY +
> > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> > processes, so it's not an exact fit.
>
> Right, for production use this is infeasible.

Last I heard, ghost agent and a few important tasks can mmap bpf array
and share it with bpf prog.
So quite feasible.

>
> > But what seems to be much closer is using BPF task-local storage, if
> > we support mmap()'ing its memory into user-space. We've had previous
> > discussions on how to achieve this (the simplest being that
> > mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> > local storage). You won't get automatic cloning (you'd have to do that
> > from the BPF program on fork/exec tracepoint, for example), and within
> > the process you'd probably want to have just one thread (main?) to
> > mmap() initially and just share the pointer across all relevant
> > threads.
>
> In the way you imagine it, would that allow all threads sharing the
> same memory, despite it being task-local? Presumably each task's local
> storage would be mapped to just point to the same memory?
>
> > But this is a more generic building block, IMO. This relying
> > on BPF map also means pinning is possible and all the other BPF map
> > abstraction benefits.
>
> Deployment-wise it will make things harder because unprivileged
> processes still have to somehow get the map's shared fd somehow to
> mmap() it. Not unsolvable, and in general what you describe looks
> interesting, but I currently can't see how it will be simpler.

bpf map can be pinned into bpffs for any unpriv process to access.
Then any task can bpf_obj_get it and mmap it.
If you have few such tasks than bpf array will do.
If you have millions of tasks then use bpf arena which is a sparse array.
Use pid as an index or some other per-task id.
Both bpf prog and all tasks can read/write such shared memory
with normal load/store instructions.

> In absence of all that, is a safer "bpf_probe_write_user()" like I
> proposed in this patch ("bpf_probe_write_user_registered()") at all
> appealing?

To be honest, another "probe" variant is not appealing.
It's pretty much bpf_probe_write_user without pr_warn_ratelimited.
The main issue with bpf_probe_read/write_user() is their non-determinism.
They will error when memory is swapped out.
These helpers are ok-ish for observability when consumers understand
that some events might be lost, but for 24/7 production use
losing reads becomes a problem that bpf prog cannot mitigate.
What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
Use some other mechanism to communicate with user space?
A mechanism with such builtin randomness in behavior is a footgun for
bpf users.
We have bpf_copy_from_user*() that don't have this non-determinism.
We can introduce bpf_copy_to_user(), but it will be usable
from sleepable bpf prog.
While it sounds you need it somewhere where scheduler makes decisions,
so I suspect bpf array or arena is a better fit.

Or something that extends bpf local storage map.
See long discussion:
https://lore.kernel.org/bpf/45878586-cc5f-435f-83fb-9a3c39824550@linux.dev/

I still like the idea to let user tasks register memory in
bpf local storage map, the kernel will pin such pages,
and then bpf prog can read/write these regions directly.
In bpf prog it will be:
ptr = bpf_task_storage_get(&map, task, ...);
if (ptr) { *ptr = ... }
and direct read/write into the same memory from user space.