lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CAM0EoMmWM5ZBtRjPSjd8AU2SkzYxHGgdEixBxGpoWvkh0dbCWg@mail.gmail.com> Date: Fri, 7 Oct 2022 11:36:33 -0400 From: Jamal Hadi Salim <jhs@...atatu.com> To: Daniel Borkmann <daniel@...earbox.net> Cc: bpf@...r.kernel.org, razor@...ckwall.org, ast@...nel.org, andrii@...nel.org, martin.lau@...ux.dev, john.fastabend@...il.com, joannelkoong@...il.com, memxor@...il.com, toke@...hat.com, joe@...ium.io, netdev@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>, Jiri Pirko <jiri@...nulli.us> Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc BPF programs On Thu, Oct 6, 2022 at 4:49 PM Daniel Borkmann <daniel@...earbox.net> wrote: > > Hi Jamal, > > On 10/5/22 9:04 PM, Jamal Hadi Salim wrote: > [...] > > Yes and no ;) In the specific example I gave there was an application bug that > led to this race of one evicting the other, so it was not intentional and also > not triggered on all the nodes in the cluster, but aside from the example, the > issue is generic one for tc BPF users. Not fd ownership, but ownership of BPF > link solves this as it does similarly for other existing BPF infra which is one > of the motivations as outlined in patch 2 to align this for tc BPF, too. Makes sense. I can see how noone would evict you with this; but it is still a race for whoever gets installed first, no? i.e you still need an arbitration scheme. And if you have a good arbitration scheme you may not need the changes. > > IIUC, this is an issue of resource contention. Both users who have > > root access think they should be prio 1. Kubernetes has no controls for this? > > For debugging, wouldnt listening to netlink events have caught this? > > I may be misunderstanding - but if both users took advantage of this > > feature seems the root cause is still unresolved i.e whoever gets there first > > becomes the owner of the highest prio? > > This is independent of K8s core; system applications for observability, runtime > enforcement, networking, etc can be deployed as Pods via kube-system namespace into > the cluster and live in the host netns. These are typically developed independently > by different groups of people. So it all depends on the use cases these applications > solve, e.g. if you try to deploy two /main/ CNI plugins which both want to provide > cluster networking, it won't fly and this is also generally understood by cluster > operators, but there can be other applications also attaching to tc BPF for more > specialized functions (f.e. observing traffic flows, setting EDT tstamp for subsequent > fq, etc) and interoperability can be provided to a certain degree with prio settings & > unspec combo to continue the pipeline. Netlink events would at best only allow to > observe the rig being pulled from underneath us, but not prevent it compared to tc > BPF links, and given the rise of BPF projects we see in K8s space, it's becoming > more crucial to avoid accidental outage just from deploying a new Pod into a > running cluster given tc BPF layer becomes more occupied. I got it i think: seems like the granularity of resource control is much higher then. Most certainly you want protection against wild-west approach where everyone wants to have the highest priority. > > Other comments on just this patch (I will pay attention in detail later): > > My two qualms: > > 1) Was bastardizing all things TC_ACT_XXX necessary? > > Maybe you could create #define somewhere visible which refers > > to the TC_ACT_XXX? > > Optional as mentioned in the other thread. It was suggested having enums which > become visible via vmlinux BTF as opposed to defines, so my thought was to lower > barrier for new developers by making the naming and supported subset more obvious > similar/closer to XDP case. I didn't want to pull in new header, but I can move it > to pkt_cls.h. > I dont think those values will ever change - but putting them in the same location will make it easier to find. > > 2) Why is xtc_run before tc_run()? > > It needs to be first in the list because its the only hook point that has an > 'ownership' model in tc BPF layer. If its first we can unequivocally know its > owner and ensure its never skipped/bypassed/removed by another BPF program either > intentionally or due to users bugs/errors. If we put it after other hooks like cls_bpf > we loose the statement because those hooks might 'steal', remove, alter the skb before > the BPF link ones are executed. I understand - its a generic problem in shared systems which from your description it seems kubernetes takes to another level. > Other option is to make this completely flexible, to > the point that Stan made, that is, tcf_classify() is just callback from the array at > a fixed position and it's completely up to the user where to add from this layer, > but we went with former approach. I am going to read the the thread again. If you make it user definable where tcf_classify() as opposed to some privilege that you are first in the code path because you already planted your flag already, then we're all happy. Let 1000 flowers bloom. It's a contentious issue Daniel. You are fixing it only for ebpf - to be precise only for new users of ebpf who migrate to new interface and not for users who are still using the existing hooks. I havent looked closely, would it not have worked to pass the link info via some TLV to current tc code? That feels like it would be more compatible with older code assuming the infra code in user space can hide things, so if someone doesnt specify their prio through something like bpftool or tc then a default of prio 0 gets sent and the kernel provides whatever that reserved space it uses today. And if they get clever they can specify a prio and it is a race of who gets there first. I think this idea of having some object for ownership is great and i am hoping it can be extended in general for tc; but we are going to need more granularity for access control other than just delete (or create); example would it make sense that permissions to add or delete table/filter/map entries could be controled this way? I'd be willing to commit resources if this was going to be done for tc in general. That aside: We dont have this problem when it comes to hardware offloading because such systems have very strict admin of control: there's typically a daemon in charge; which by itself is naive in the sense someone with root could go underneath you and do things - hence my interest in not just ownership but also access control. cheers, jamal
Powered by blists - more mailing lists