[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y0BhnccTKfXORS5E@google.com>
Date: Fri, 7 Oct 2022 10:27:57 -0700
From: sdf@...gle.com
To: Cong Wang <xiyou.wangcong@...il.com>
Cc: netdev@...r.kernel.org, yangpeihao@...u.edu.cn, toke@...hat.com,
jhs@...atatu.com, jiri@...nulli.us, bpf@...r.kernel.org,
Cong Wang <cong.wang@...edance.com>
Subject: Re: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc
On 10/05, Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>
> This *incomplete* patchset introduces a programmable Qdisc with eBPF.
> There are a few use cases:
> 1. Allow customizing Qdisc's in an easier way. So that people don't
> have to write a complete Qdisc kernel module just to experiment
> some new queuing theory.
> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
> is before enqueue, it is impossible to adjust those "tokens" after
> packets get dropped in enqueue. With eBPF Qdisc, it is easy to
> be solved with a shared map between clsact and sch_bpf.
> 3. Replace qevents, as now the user gains much more control over the
> skb and queues.
> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
> chain and block to reuse the TC filters, but they are too complicated
> to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
> TC filters on _any_ Qdisc (even on a different netdev) to do the
> classification.
> 5. Potentially pave a way for ingress to queue packets, although
> current implementation is still only for egress.
> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
> is already used by TCP to handle TCP retransmission.
> 7. Potentially pave a way for implementing eBPF based CPU scheduler,
> because we already have task kptr and CFS is clearly based on rbtree
> too.
> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity. More importantly, with
> rbtree, it is easy to implement PIFO logic too.
> Here is a summary of design decisions I made:
[..]
> 1. Avoid eBPF struct_ops, as it would be really hard to program
> a Qdisc with this approach, literally all the struct Qdisc_ops
> and struct Qdisc_class_ops are needed to implement. This is almost
> as hard as programming a Qdisc kernel module.
Can you expand more on what's exactly 'hard'? We need to override
enqueue/dequeue mostly and that's it?
Does it make sense to at least do, say, sch_struct_opts? Maybe we can
have some higher level struct_opts with enqueue/dequeue only for
bpf to override?
> 2. Introduce a generic rbtree map, which supports kptr, so that skb's
> can be stored inside.
There has been several rbtree proposals. Do we really need an rbtree
here (ignoring performance)? Can we demonstrate these new qdisc
capabilities with a hashtable?
> a) As eBPF maps are not directly visible to the kernel, we have to
> dump the stats via eBPF map API's instead of netlink.
> b) The user-space is not allowed to read the entire packets, only
> __sk_buff
> itself is readable, because we don't have such a use case yet and it
> would
> require a different API to read the data, as map values have fixed
> length.
[..]
> c) Multi-queue support is implemented via map-in-map, in a similar
> push/pop fasion based on rbtree too.
I took a brief look at sch_bpf.c and I don't see how that would work.
Can you expand more on that (+locking)?
Also, in sch_bpf.c, you care about classid, why? Isn't the whole purpose
of the series is to expand tc capabilities? Why are we still dragging
this classid?
> d) Use the netdevice notifier to reset the packets inside skb map upon
> NETDEV_DOWN event. (TODO: need to recognize kptr type)
[..]
> 3. Integrate with existing TC infra. For example, if the user doesn't want
> to implement her own filters (e.g. a flow dissector), she should be
> able
> to re-use the existing TC filters. Another helper
> bpf_skb_tc_classify() is
> introduced for this purpose.
Supposedly, the idea is that bpf can implement all existing policies +
more, so why are we trying to re-use existing filters?
> Any high-level feedback is welcome. Please kindly ignore any coding
> details
> until RFC tag is removed.
> TODO:
> 1. actually test it
> 2. write a document for this Qdisc
[..]
> 3. add test cases and sample code
I'd start with that; otherwise, it's not clear what are the current
capabilities.
> Cc: Toke H�iland-J�rgensen <toke@...hat.com>
> Cc: Jamal Hadi Salim <jhs@...atatu.com>
> Cc: Jiri Pirko <jiri@...nulli.us>
> Signed-off-by: Cong Wang <cong.wang@...edance.com>
> ---
> v6: switch to kptr based approach
> v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
> implement flow map as map-in-map
> rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
> clean up eBPF qdisc program context
> v4: get rid of PIFO, use rbtree directly
> v3: move priority queue from sch_bpf to skb map
> introduce skb map and its helpers
> introduce bpf_skb_classify()
> use netdevice notifier to reset skb's
> Rebase on latest bpf-next
> v2: Rebase on latest net-next
> Make the code more complete (but still incomplete)
> Cong Wang (5):
> bpf: Introduce rbtree map
> bpf: Add map in map support to rbtree
> net_sched: introduce eBPF based Qdisc
> net_sched: Add kfuncs for storing skb
> net_sched: Introduce helper bpf_skb_tc_classify()
> include/linux/bpf.h | 4 +
> include/linux/bpf_types.h | 4 +
> include/uapi/linux/bpf.h | 19 ++
> include/uapi/linux/pkt_sched.h | 17 +
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/rbtree.c | 603 +++++++++++++++++++++++++++++++++
> kernel/bpf/syscall.c | 7 +
> net/core/filter.c | 27 ++
> net/sched/Kconfig | 15 +
> net/sched/Makefile | 1 +
> net/sched/cls_api.c | 69 ++++
> net/sched/sch_bpf.c | 564 ++++++++++++++++++++++++++++++
> 12 files changed, 1331 insertions(+), 1 deletion(-)
> create mode 100644 kernel/bpf/rbtree.c
> create mode 100644 net/sched/sch_bpf.c
> --
> 2.34.1
Powered by blists - more mailing lists