[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <YrYj3LPaHV7thgJW@google.com>
Date: Fri, 24 Jun 2022 13:51:40 -0700
From: sdf@...gle.com
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org,
Cong Wang <cong.wang@...edance.com>,
"Toke Høiland-Jørgensen" <toke@...hat.com>,
Jamal Hadi Salim <jhs@...atatu.com>,
Jiri Pirko <jiri@...nulli.us>
Subject: Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc
On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>
> This *incomplete* patchset introduces a programmable Qdisc with eBPF.
> There are a few use cases:
> 1. Allow customizing Qdisc's in an easier way. So that people don't
> have to write a complete Qdisc kernel module just to experiment
> some new queuing theory.
> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> is before enqueue, it is impossible to adjust those "tokens" after
> packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> be solved with a shared map between clsact and sch_bpf.
> 3. Replace qevents, as now the user gains much more control over the
> skb and queues.
> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> chain and block to reuse the TC filters, but they are too complicated
> to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> TC filters on _any_ Qdisc (even on a different netdev) to do the
> classification.
> 5. Potentially pave a way for ingress to queue packets, although
> current implementation is still only for egress.
> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
> is already used by TCP to handle TCP retransmission.
> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.
> Here is a summary of design decisions I made:
> 1. Avoid eBPF struct_ops, as it would be really hard to program
> a Qdisc with this approach, literally all the struct Qdisc_ops
> and struct Qdisc_class_ops are needed to implement. This is almost
> as hard as programming a Qdisc kernel module.
> 2. Introduce skb map, which will allow other eBPF programs to store skb's
> too.
> a) As eBPF maps are not directly visible to the kernel, we have to
> dump the stats via eBPF map API's instead of netlink.
> b) The user-space is not allowed to read the entire packets, only
> __sk_buff
> itself is readable, because we don't have such a use case yet and it
> would
> require a different API to read the data, as map values have fixed
> length.
> c) Two eBPF helpers are introduced for skb map operations:
> bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
> not allowed.
> d) Multi-queue support is implemented via map-in-map, in a similar
> push/pop fasion.
> e) Use the netdevice notifier to reset the packets inside skb map upon
> NETDEV_DOWN event.
> 3. Integrate with existing TC infra. For example, if the user doesn't want
> to implement her own filters (e.g. a flow dissector), she should be
> able
> to re-use the existing TC filters. Another helper
> bpf_skb_tc_classify() is
> introduced for this purpose.
> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.
> TODO:
> 1. actually test it
Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.
Powered by blists - more mailing lists