netdev - Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210402152743.dbadpgcmrgjt4eca@apollo>
Date:   Fri, 2 Apr 2021 20:57:43 +0530
From:   Kumar Kartikeya Dwivedi <memxor@...il.com>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     Andrii Nakryiko <andrii.nakryiko@...il.com>,
        bpf <bpf@...r.kernel.org>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>, Shuah Khan <shuah@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        open list <linux-kernel@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>,
        "open list:KERNEL SELFTEST FRAMEWORK" 
        <linux-kselftest@...r.kernel.org>
Subject: Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

On Fri, Apr 02, 2021 at 05:49:29AM IST, Daniel Borkmann wrote:
> On 3/31/21 11:44 AM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Mar 31, 2021 at 02:55:47AM IST, Daniel Borkmann wrote:
> > > Do we even need the _block variant? I would rather prefer to take the chance
> > > and make it as simple as possible, and only iff really needed extend with
> > > other APIs, for example:
> >
> > The block variant can be dropped, I'll use the TC_BLOCK/TC_DEV alternative which
> > sets parent_id/ifindex properly.
> >
> > >    bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS});
> > >
> > > Internally, this will create the sch_clsact qdisc & cls_bpf filter instance
> > > iff not present yet, and attach to a default prio 1 handle 1, and _always_ in
> > > direct-action mode. This is /as simple as it gets/ and we don't need to bother
> > > users with more complex tc/cls_bpf internals unless desired. For example,
> > > extended APIs could add prio/parent so that multi-prog can be attached to a
> > > single cls_bpf instance, but even that could be a second step, imho.
> >
> > I am not opposed to clsact qdisc setup if INGRESS/EGRESS is supplied (not sure
> > how others feel about it).
>
> What speaks against it? It would be 100% clear from API side where the prog is
> being attached. Same as with tc cmdline where you specify 'ingress'/'egress'.
>

Ok, I will add the qdisc setup in the next revision.

> > We could make direct_action mode default, and similarly choose prio
>
> To be honest, I wouldn't even support a mode from the lib/API side where direct_action
> is not set. It should always be forced to true. Everything else is rather broken
> setup-wise, imho, since it won't scale. We added direct_action a bit later to the
> kernel than original cls_bpf, but if I would do it again today, I'd make it the
> only available option. I don't see a reasonable use-case where you have it to false.
>

I'm all for doing that, but in some sense that also speaks against SCHED_ACT
support. Currently, you can load SCHED_ACT programs using this series, but not
really bind them to classifier. I left that option open to a future patch, it
would just reuse the existing tc_act_add_action helper (also why I kept it in
its own function). Maybe we need to reconsider that, if direct action is the
only recommended way going forward (to discourage people from using SCHED_ACT),
or just add opts to do all the setup in low level API, instead of leaving it
incomplete.

> > as 1 by default instead of letting the kernel do it. Then you can just pass in
> > NULL for bpf_tc_cls_opts and be close to what you're proposing. For protocol we
> > can choose ETH_P_ALL by default too if the user doesn't set it.
>
> Same here with ETH_P_ALL, I'm not sure anyone uses anything other than ETH_P_ALL,
> so yes, that should be default.

Ack.

>
> > With these modifications, the equivalent would look like
> > 	bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, INGRESS), NULL, &id);
>
> Few things compared to bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
>
> 1) nit, but why even 'cls' in the name. I think we shouldn't expose such old-days
>    tc semantics to a user. Just bpf_tc_attach() is cleaner/simpler to understand.

Since it would make it clear this is for SCHED_CLS progs, likewise bpf_tc_act_*
is for SCHED_ACT progs. Not opposed to changing the name.

> 2) What's the 'TC_DEV(ifindex, INGRESS)' macro doing exactly? Looks unnecessary,
>    why not regular args to the API?

It is very easy to support BLOCK (I know it's not really popular here, but I
think if supporting it just requires adding a macro, then we can go ahead). So
the user can use TC_BLOCK(block_idx) instead of remembering ifindex is to be set
to TCM_IFINDEX_MAGIC_BLOCK and parent_id to actual block index. It will just
expand to:

#define TC_BLOCK(block_idx) TCM_IFINDEX_MAGIC_BLOCK, (block_idx)

TC_DEV macro can be dropped, since user can directly pass ifindex and parent_id.

> 3) Exposed bpf_tc_attach() API could internally call a bpf_tc_attach_opts() API
>    with preset defaults, and the latter could have all the custom bits if the user
>    needs to go beyond the simple API, so from your bpf_tc_cls_attach() I'd also
>    drop the NULL.

Ok, this is probably better (but maybe we can do this for the high-level
bpf_program__attach that returns a bpf_link * instead of introducing yet
another function).

> 4) For the simple API I'd likely also drop the id (you could have a query API if
>    needed).
>

This would be fine, because it's not a fast path or anything, but right now we
return the id using the netlink response, otherwise for query we have to open
the socket, prepare the msg, send and recv again. So it's a minor optimization.

However, there's one other problem. In an earlier version of this series, I
didn't keep the id/index out parameters (to act as handle to the newly attached
filter/action). This lead to problems on query. Suppose a user doesn't properly
fill the opts during query (e.g. in case of filters). This means the netlink
dump includes all filters matching filled in attributes. If the prog_id for all
of these is same (e.g. all have same bpf classifier prog attached to them), it
becomes impossible to determine which one is the filter user asked for. It is
not possible to enforce filling in all kinds of attributes since some can be
left out and assigned by default in the kernel (priority, chain_index etc.). So
returning the newly created filter's id turned out to be the best option. This
is also used to stash filter related information in bpf_link to properly release
it later.

The same problem happens with actions, where we look up using the prog_id, we
multiple actions with different index can match on same prog_id. It is not
possible to determine which index corresponds to last loaded action.

So unless there's a better idea on how to deal with this, a query API won't work
for the case where same bpf prog is attached more than once. Returning the
id/index during attach seemed better than all other options we considered.

--
Kartikeya