linux-kernel - Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzZCj=3hevf+Je=oed9Nisctotp_CX00NrLaO6_7+-0LSQ@mail.gmail.com>
Date: Fri, 5 Apr 2024 13:28:32 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Marco Elver <elver@...gle.com>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Andrii Nakryiko <andrii@...nel.org>, 
	Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>, Song Liu <song@...nel.org>, 
	Yonghong Song <yonghong.song@...ux.dev>, John Fastabend <john.fastabend@...il.com>, 
	KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>, 
	Jiri Olsa <jolsa@...nel.org>, Dmitry Vyukov <dvyukov@...gle.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, bpf <bpf@...r.kernel.org>, 
	"open list:DOCUMENTATION" <linux-doc@...r.kernel.org>, linux-trace-kernel@...r.kernel.org, 
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <elver@...gle.com> wrote:
>
> On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> <alexei.starovoitov@...il.com> wrote:
> >
> > On Thu, Apr 4, 2024 at 12:02 PM Marco Elver <elver@...gle.com> wrote:
> > >
> > > With all the known caveats, tracing BPF programs may directly write to
> > > user-space memory with the bpf_probe_write_user() helper. Memory safety
> > > is an obvious problem when using this helper, since it is too easy to
> > > overwrite memory across all running processes that user space did not
> > > expect to be touched (neither the verifier nor the kernel knows what may
> > > be touched). While it is possible to come up with mechanisms to safely
> > > communicate to the BPF program which memory region may be written to,
> > > there are no built-in guarantees of safety. For this reason, the helper
> > > produces a warning in the kernel log, and in newer kernels it is
> > > possible to disallow use of the helper since 51e1bb9eeaf7 ("bpf: Add
> > > lockdown check for probe_write_user helper").
> >
> > So is it a fix or a feature?
>
> Feature. The above paragraph is just an intro. Remove it?
>
> > > Nevertheless, direct user-space memory writes from BPF programs can be
> > > useful to efficiently manipulate and communicate with cooperating user
> > > space processes.
> >
> > But there are many different ways for bpf to communicate with user space:
> > perf ringbuf, bpf ringbug, various maps including mmap-ed array and arena.
> > The commit log doesn't explain why we need another one.
> >
> > > For example, one of our use cases are for events that happen relatively
> > > frequently in the kernel (e.g. specific scheduler events), but a set of
> > > user space threads want to check for such events in very hot code paths
> > > to make more optimal decisions (the cost of such a check can be no more
> > > than a load and compare). The types of events and heuristics used may
> > > change based on system environment and application, and a BPF program
> > > provides the best trade-offs in terms of performance and deployment.
> >
> > and the tasks can use mmaped array shared across all or unique to each
> > process.
> > And both bpf and user space can read/write them with a single instruction.
>
> That's BPF_F_MMAPABLE, right?
>
> That does not work because the mmapped region is global. Our requirements are:
>
> 1. Single tracing BPF program.
>
> 2. Per-process (per VM) memory region (here it's per-thread, but each
> thread just registers the same process-wide region).  No sharing
> between processes.
>
> 3. From #2 it follows: exec unregisters the registered memory region;
> fork gets a cloned region.
>
> 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> not be able to use the bpf syscall.
>
> The reason for #2 is that each user space process also writes to the
> memory region (read by the BPF program to make updates depending on
> what state it finds), and having shared state between processes
> doesn't work here.
>
> Is there any reasonable BPF facility that can do this today? (If
> BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> happy camper.)

You could simulate something like this with multi-element ARRAY +
BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
processes, so it's not an exact fit.

But what seems to be much closer is using BPF task-local storage, if
we support mmap()'ing its memory into user-space. We've had previous
discussions on how to achieve this (the simplest being that
mmap(task_local_map_fd, ...) maps current thread's part of BPF task
local storage). You won't get automatic cloning (you'd have to do that
from the BPF program on fork/exec tracepoint, for example), and within
the process you'd probably want to have just one thread (main?) to
mmap() initially and just share the pointer across all relevant
threads. But this is a more generic building block, IMO. This relying
on BPF map also means pinning is possible and all the other BPF map
abstraction benefits.


>
> bpf_probe_write_user() can, but safety is not built in, along with
> getting fork + exec right is brittle.
>
> > > To achieve better safety, introduce tagged user writable regions, that
> > > must explicitly be registered before tracing BPF programs may use them:
> > >
> > >  1. The prctl() option PR_BPF_REGISTER_WRITABLE allows any user space
> > >     process (that is allowed to use prctl()) to register tagged writable
> > >     memory regions for the current thread.
> > >
> > >  2. Conversely, the prctl() option PR_BPF_UNREGISTER_WRITABLE allows a
> > >     user space process to unregister a writable memory region that was
> > >     previously registered from the current thread. This must be done
> > >     before freeing memory if the thread that registered a region is
> > >     still running.
> > >
> > >  3. Tracing BPF programs may write to any registered region in the
> > >     current thread with bpf_probe_write_user_registered(). If the memory
> > >     region has been tagged with a non-zero value, the BPF program must
> > >     provide a matching tag.
> > >
> > > Admin capabilities are still required to attach BPF programs that use
> > > the new bpf_probe_write_user_registered() helper.
> >
> > We stopped adding new helpers ~2 years ago.
> > Only new kfuncs are allowed.
>
> Sure.
>
> > >
> > > With this interface, user space threads are guaranteed that no writes
> > > happen to regions that they did not explicitly register. Tagging can be
> > > used to associate additional semantics with the memory region.
> > >
> > > A note on tag allocation: Since such programs can only be installed by
> > > the local system administrator, tag allocation may be done by the system
> > > administrator. For example, by providing headers with tag definitions,
> > > or a central service to distribute tags to the BPF program loader and to
> > > user applications.
> >
> > Not clear how that's achieved in practice.
> > To do prctl(REGSISTER, ... tag)
> > the process will just pass this u32 tag.
> > There is no way for the admin or other process to enforce certain
> > tag usage.
> > Since there is no way to enforce extra tag seems like a weak
> > protection against something? What in particular?
>
> The main goal is to a) avoid accidental writes into areas the user
> space program doesn't want writing to, along with 2) weakly
> associating "type" via a tag. As soon as the user space program does
> prctl(REGISTER, tag) it wants writes to happen. It's the user space
> program's fault if it uses random tags, because in practice, we know
> exactly which BPF programs are running in production and which tags
> they service. The programs we do _care_ about are reviewed and do use
> the right tags. The mechanism is not for protecting BPF programs or
> what data they communicate, but for the benefit of the user space
> program itself to bound the memory that could be touched in its own
> address space (use the wrong region/tag .. tough luck).