linux-kernel - Re: [RFC PATCH v3 28/37] WIP: bpf: Add fuse_ops struct

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+PiJmTQ-mpSywa-P6aW5VSUQu-49XtxNDMZP4Ks84UtVjDsNA@mail.gmail.com>
Date:   Tue, 2 May 2023 18:53:25 -0700
From:   Daniel Rosenberg <drosen@...gle.com>
To:     Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc:     Miklos Szeredi <miklos@...redi.hu>, bpf@...r.kernel.org,
        Alexei Starovoitov <ast@...nel.org>,
        Amir Goldstein <amir73il@...il.com>,
        linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-unionfs@...r.kernel.org,
        Daniel Borkmann <daniel@...earbox.net>,
        John Fastabend <john.fastabend@...il.com>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <martin.lau@...ux.dev>,
        Song Liu <song@...nel.org>, Yonghong Song <yhs@...com>,
        KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>,
        Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
        Shuah Khan <shuah@...nel.org>,
        Jonathan Corbet <corbet@....net>,
        Joanne Koong <joannelkoong@...il.com>,
        Mykola Lysenko <mykolal@...com>, kernel-team@...roid.com
Subject: Re: [RFC PATCH v3 28/37] WIP: bpf: Add fuse_ops struct_op programs

On Wed, Apr 26, 2023 at 9:18 PM Andrii Nakryiko
<andrii.nakryiko@...il.com> wrote:
>
> Have you considered grouping this huge amount of callbacks into a
> smaller set of more generic callbacks where each callback would get
> enum argument specifying what sort of operation it is called for? This
> has many advantages, starting from not having to deal with struct_ops
> limits, ending with not needing to instantiate dozens of individual
> BPF programs.
>
> E.g., for a lot of operations the difference between pre- and
> post-filter is in having in argument as read-only and maybe having
> extra out argument for post-filter. One way to unify such post/pre
> filters into one callback would be to record whether in has to be
> read-only  or read-write and not allow to create r/w dynptr for the
> former case. Pass bool or enum specifying if it's post or pre filter.
> For that optional out argument, you can simulate effectively the same
> by always supplying it, but making sure that out parameter is
> read-only and zero-sized, for example.
>
> That would cut the number of callbacks in two, which I'd say still is
> not great :) I think it would be better still to have even larger
> groups of callbacks for whole families of operations with the same (or
> "unifiable") interface (domain experts like you would need to do an
> analysis here to see what makes sense to group, of course).
>
> We'll probably touch on that tomorrow at BPF office hours, but I
> wanted to point this out beforehand, so that you have time to think
> about it.
>

The meta info struct we pass in includes the opcode which contains
whether it is a prefilter or postfilter, although I guess that may be
less accessible to the verifier than a separate bool. In the v1
version, we handled all op codes in a single program, although I think
we were running into some slowdowns when we had every opcode in a
giant switch statement, plus we were incurring the cost of the bpf
program even when we didn't need to do anything in it. The struct_op
version lets us entirely skip calling the bpf for opcodes we don't
need to handle.

Many of the arguments we pass currently are structs. If they were all
dynptrs, we could set the output related ones to empty/readonly, but
that removes one of the other strengths of the struct_op setup, where
we can actually label the inputs as the structs they are instead of a
void* equivalent. There are definitely some cases where we could
easily merge opcode callbacks, like FUSE_FSYNCDIR/FUSE_FSYNC and
FUSE_OPEN/FUSE_OPENDIR. I set them up as separate since it's easy to
assign the same program to both callbacks in the case where you want
both to be handled the same, while maintaining flexibility to handle
them separately.

> +void bpf_fuse_get_rw_dynptr(struct fuse_buffer *buffer, struct bpf_dynptr_kern *dynptr__uninit, u64 size, bool copy)
>
> not clear why size is passed from outside instead of instantiating
> dynptr with buffer->size? See [0] for bpf_dynptr_adjust and
> bpf_dynptr_clone that allow you to adjust buffer as necessary.
>
> As for the copy parameter, can you elaborate on the idea behind it?
>
>   [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=741584&state=*
>

We're storing these buffers as fuse_buffers initially because of the
additional metadata we're carrying. Some fields have variable lengths,
or are backed by const data. For instance, names. If you wanted to
alter the name we use on the lower filesystem, you cannot change it
directly since it's being backed by the dentry name. If you wanted to
adjust something, like perhaps adding an extension, you would pass
bpf_fuse_get_rw_dynptr the size you'd want for the new buffer, and
copy=true to get the preexisting data. Fuse_buffer tracks that data
was allocated so Fuse can clean up after the call. Additionally, say
you wanted to trim half the data returned by an xattr for some reason.
You would give it a size less than the buffer size to inform fuse that
it should ignore the second half of the data. That part could be
handled by bpf_dynptr_adjust if we didn't also need to handle the
allocation case.
Say you wanted to have the lower file name be the hash of the one you
created. In that case, you could get bpf_fuse_get_ro_dynptr to get
access to compute the hash, and then bpf_fuse_get_rw_dynptr to get a
buffer to write the hash to. Since the data is not directly related to
the original data, there would be no benefit to getting a copy.

I initially intended for bpf_fuse_get_ro_dynptr/bpf_fuse_get_rw_dynptr
to be called at most once for each field, but that may be too
restrictive. At the moment, if you make two calls that require
reallocating, any pointers to the old buffer would be invalid. This is
not the case for the original name, as we aren't touching the original
source. There are two possible approaches here. I could either
refcount the buffer and have a put kfunc, or I could invalidate old
dynpointers when bpf_fuse_get_rw_dynptr is called, similar to what
skb/xdp do. I'm leaning towards the latter to disallow having many
allocations active at once by calling bpf_fuse_get_rw_dynptr for
increasing sizes, though I could also just disallow reallocating a
buffer that already was reallocated.

The new dynptr helpers are pretty exciting since they'll make it much
easier to deal with chunks of data, which we may end up doing in
read/write filters. I haven't fully set those up since I was waiting
to see what the dynptr helpers ended up looking like.


> > +void bpf_fuse_get_ro_dynptr(const struct fuse_buffer *buffer, struct bpf_dynptr_kern *dynptr__uninit)
>
> these kfuncs probably should be more consistently named as
> bpf_dynptr_from_fuse_buffer_{ro,rw}() ?
>
Yeah, that fits in much better with the skb/xdp functions.

> > +
> > +uint32_t bpf_fuse_return_len(struct fuse_buffer *buffer)
> > +{
> > +       return buffer->size;
>
> you should be able to get this with bpf_dynptr_size() (once you create
> it from fuse_buffer).
>

Yes, this might be unnecessary. I added it while testing kfuncs, and
had intended to use it with a fuse_buffer strncmp before I saw that
there's now a bpf_strncmp :) I had tried using it with
bpf_dynptr_slice, but that requires a known constant at verification
time, which may make using it in real cases a bit difficult...
bpf_strncmp also has some restrictions around the second string being
a fixed map, or something like that.

>
> you probably should be fine with just using bpf_tracing_func_proto as is
>
> > +       .is_valid_access        = bpf_fuse_is_valid_access,
>
> similarly, why custom no-op callback?
>

Those are largely carried over from iterations when I was less sure
what I would need. A lot of the work I was doing in the v1 code is
handled by default with the struct_op setup now, or is otherwise
unnecessary. This area in particular needs a lot of cleanup.

> > +static int bpf_fuse_init(struct btf *btf)
> > +{
> > +       s32 type_id;
> > +
> > +       type_id = btf_find_by_name_kind(btf, "fuse_buffer", BTF_KIND_STRUCT);
> > +       if (type_id < 0)
> > +               return -EINVAL;
> > +       fuse_buffer_struct_type = btf_type_by_id(btf, type_id);
> > +
>
> see BTF_ID and BTF_ID_LIST uses for how to get ID for your custom
> well-known type
>
Thanks, I'll look into those.