netdev - Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM0EoMmWM5ZBtRjPSjd8AU2SkzYxHGgdEixBxGpoWvkh0dbCWg@mail.gmail.com>
Date:   Fri, 7 Oct 2022 11:36:33 -0400
From:   Jamal Hadi Salim <jhs@...atatu.com>
To:     Daniel Borkmann <daniel@...earbox.net>
Cc:     bpf@...r.kernel.org, razor@...ckwall.org, ast@...nel.org,
        andrii@...nel.org, martin.lau@...ux.dev, john.fastabend@...il.com,
        joannelkoong@...il.com, memxor@...il.com, toke@...hat.com,
        joe@...ium.io, netdev@...r.kernel.org,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
 BPF programs

On Thu, Oct 6, 2022 at 4:49 PM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> Hi Jamal,
>
> On 10/5/22 9:04 PM, Jamal Hadi Salim wrote:
> [...]

>
> Yes and no ;) In the specific example I gave there was an application bug that
> led to this race of one evicting the other, so it was not intentional and also
> not triggered on all the nodes in the cluster, but aside from the example, the
> issue is generic one for tc BPF users. Not fd ownership, but ownership of BPF
> link solves this as it does similarly for other existing BPF infra which is one
> of the motivations as outlined in patch 2 to align this for tc BPF, too.

Makes sense. I can see how noone would evict you with this; but it is still
a race for whoever gets installed first, no? i.e you still need an
arbitration scheme.
And if you have a good arbitration scheme you may not need the changes.

> > IIUC,  this is an issue of resource contention. Both users who have
> > root access think they should be prio 1. Kubernetes has no controls for this?
> > For debugging, wouldnt listening to netlink events have caught this?
> > I may be misunderstanding - but if both users took advantage of this
> > feature seems the root cause is still unresolved i.e  whoever gets there first
> > becomes the owner of the highest prio?
>
> This is independent of K8s core; system applications for observability, runtime
> enforcement, networking, etc can be deployed as Pods via kube-system namespace into
> the cluster and live in the host netns. These are typically developed independently
> by different groups of people. So it all depends on the use cases these applications
> solve, e.g. if you try to deploy two /main/ CNI plugins which both want to provide
> cluster networking, it won't fly and this is also generally understood by cluster
> operators, but there can be other applications also attaching to tc BPF for more
> specialized functions (f.e. observing traffic flows, setting EDT tstamp for subsequent
> fq, etc) and interoperability can be provided to a certain degree with prio settings &
> unspec combo to continue the pipeline. Netlink events would at best only allow to
> observe the rig being pulled from underneath us, but not prevent it compared to tc
> BPF links, and given the rise of BPF projects we see in K8s space, it's becoming
> more crucial to avoid accidental outage just from deploying a new Pod into a
> running cluster given tc BPF layer becomes more occupied.

I got it i think: seems like the granularity of resource control is
much higher then.
Most certainly you want protection against wild-west approach where everyone
wants to have the highest priority.


> > Other comments on just this patch (I will pay attention in detail later):
> > My two qualms:
> > 1) Was bastardizing all things TC_ACT_XXX necessary?
> > Maybe you could create #define somewhere visible which refers
> > to the TC_ACT_XXX?
>
> Optional as mentioned in the other thread. It was suggested having enums which
> become visible via vmlinux BTF as opposed to defines, so my thought was to lower
> barrier for new developers by making the naming and supported subset more obvious
> similar/closer to XDP case. I didn't want to pull in new header, but I can move it
> to pkt_cls.h.
>

I dont think those values will ever change - but putting them in the
same location
will make it easier to find.

> > 2) Why is xtc_run before tc_run()?
>
> It needs to be first in the list because its the only hook point that has an
> 'ownership' model in tc BPF layer. If its first we can unequivocally know its
> owner and ensure its never skipped/bypassed/removed by another BPF program either
> intentionally or due to users bugs/errors. If we put it after other hooks like cls_bpf
> we loose the statement because those hooks might 'steal', remove, alter the skb before
> the BPF link ones are executed.

I understand - its a generic problem in shared systems which from your
description
it seems kubernetes takes to another level.

> Other option is to make this completely flexible, to
> the point that Stan made, that is, tcf_classify() is just callback from the array at
> a fixed position and it's completely up to the user where to add from this layer,
> but we went with former approach.

I am going to read the the thread again. If you make it user definable where
tcf_classify() as opposed to some privilege that you are first in the code path
because you already planted your flag already, then we're all happy.
Let 1000 flowers bloom.

It's a contentious issue Daniel. You are fixing it only for ebpf - to
be precise only
for new users of ebpf who migrate to new interface and not for users
who are still
using the existing hooks. I havent looked closely, would it not have
worked to pass
the link info via some TLV to current tc code? That feels like it would be more
compatible with older code assuming the infra code in user space can
hide things,
so if someone doesnt specify their prio through something like bpftool
or tc then a
default of prio 0 gets sent and the kernel provides whatever that
reserved space it
uses today. And if they get clever they can specify a prio and it is a
race of who gets
there first.
I think this idea of having some object for ownership is great and i am hoping
it can be extended in general for tc; but we are going to need more granularity
for access control other than just delete (or create); example would it make
sense that permissions to add or delete table/filter/map entries could
be controled
this way? I'd be willing to commit resources if this was going to be done for tc
in general.

That aside:
We dont have this problem when it comes to hardware offloading because such
systems have very strict admin of control: there's typically a daemon in charge;
which by itself is naive in the sense someone with root could go underneath you
and do things - hence my interest in not just ownership but also access control.

cheers,
jamal