[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <768c85a3-d307-28bb-761f-9cda9b414aae@iogearbox.net>
Date: Wed, 5 Oct 2022 20:21:20 +0200
From: Daniel Borkmann <daniel@...earbox.net>
To: sdf@...gle.com
Cc: bpf@...r.kernel.org, razor@...ckwall.org, ast@...nel.org,
andrii@...nel.org, martin.lau@...ux.dev, john.fastabend@...il.com,
joannelkoong@...il.com, memxor@...il.com, toke@...hat.com,
joe@...ium.io, netdev@...r.kernel.org
Subject: Re: [PATCH bpf-next 01/10] bpf: Add initial fd-based API to attach tc
BPF programs
On 10/5/22 7:56 PM, sdf@...gle.com wrote:
> On 10/05, Daniel Borkmann wrote:
>> On 10/5/22 2:55 AM, sdf@...gle.com wrote:
>> > On 10/05, Daniel Borkmann wrote:
>> [...]
>> >
>> > The series looks exciting, haven't had a chance to look deeply, will try
>> > to find some time this week.
>
>> Great, thanks!
>
>> > We've chatted briefly about priority during the talk, let's maybe discuss
>> > it here more?
>> >
>> > I, as a user, still really have no clue about what priority to use.
>> > We have this problem at tc, and we'll seemingly have the same problem
>> > here? I guess it's even more relevant in k8s because internally at G we
>> > can control the users.
>> >
>> > Is it worth at least trying to provide some default bands / guidance?
>> >
>> > For example, having SEC('tc/ingress') receive attach_priority=124 by
>> > default? Maybe we can even have something like 'tc/ingress_first' get
>> > attach_priority=1 and 'tc/ingress_last' with attach_priority=254?
>> > (the names are arbitrary, we can do something better)
>> >
>> > ingress_first/ingress_last can be used by some monitoring jobs. The rest
>> > can use default 124. If somebody really needs a custom priority, then they
>> > can manually use something around 124/2 if they need to trigger before the
>> > 'default' priority or 124+124/2 if they want to trigger after?
>> >
>> > Thoughts? Is it worth it? Do we care?
>
>> I think guidance is needed, yes, I can add a few paragraphs to the libbpf
>> header file where we also have the tc BPF link API. I had a brief discussion
>> around this also with developers from datadog as they also use the infra
>> via tc BPF. Overall, its a hard problem, and I don't think there's a good
>> generic solution. The '*_last' is implied by prio=0, so that kernel auto-
>> allocates it, and for libbpf we could add an API for it where the user
>> does not need to specify prio specifically. The 'appending' is reasonable
>> to me given if an application explicitly requests to be added as first
>> (and e.g. enforces policy through tc BPF), but some other 3rd party application
>> prepends itself as first, it can bypass the former, which would be too easy
>> to shoot yourself in the foot. Overall the issue in tc land is that ordering
>> matters, skb packet data could be mangled (e.g. IPs NATed), skb fields can
>> be mangled, and we can have redirect actions (dev A vs. B); the only way I'd
>> see were this is possible if somewhat verifier can annotate the prog when
>> it didn't observe any writes to skb, and no redirect was in play. Then you've
>> kind of replicated the constraints similar to tracing where the attachment
>> can say that ordering doesn't matter if all the progs are in same style.
>> Otherwise, explicit corporation is needed as is today with rest of tc (or
>> as Toke did in libxdp) with multi-attach. In the specific case I mentioned
>> at LPC, it can be made to work given one of the two is only observing traffic
>> at the layer, e.g. it could get prepended if there is guarantee that all
>> return codes are tc_act_unspec so that there is no bypass and then you'll
>> see all traffic or appended to see only traffic which made it past the
>> policy. So it all depends on the applications installing programs, but to
>> solve it generically is not possible given ordering and conflicting actions.
>> So, imho, an _append() API for libbpf can be added along with guidance for
>> developers when to use _append() vs explicit prio.
>
> Agreed, it's a hard problem to solve, especially from the kernel side.
> Ideally, as Toke mentions on the side thread, there should be some kind
> of system daemon or some other place where the ordering is described.
> But let's start with at least some guidance on the current prio.
>
> Might be also a good idea to narrow down the prio range to 0-65k for
> now? Maybe in the future we'll have some special PRIO_MONITORING_BEFORE_ALL
> and PRIO_MONITORING_AFTER_ALL that trigger regardless of TC_ACT_UNSPEC?
> I agree with Toke that it's another problem with the current action based
> chains that's worth solving somehow (compared to, say, cgroup programs).
Makes sense, I'll restrict the range so there's headroom for future
extensions, the mentioned 0-65k looks very reasonable to me.
Thanks,
Daniel
Powered by blists - more mailing lists