[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20211224200059.161979-1-xiyou.wangcong@gmail.com>
Date: Fri, 24 Dec 2021 12:00:56 -0800
From: Cong Wang <xiyou.wangcong@...il.com>
To: netdev@...r.kernel.org
Cc: bpf@...r.kernel.org, Cong Wang <cong.wang@...edance.com>,
Toke Høiland-Jørgensen <toke@...hat.com>,
Jamal Hadi Salim <jhs@...atatu.com>,
Jiri Pirko <jiri@...nulli.us>
Subject: [RFC Patch v3 0/3] net_sched: introduce eBPF based Qdisc
From: Cong Wang <cong.wang@...edance.com>
This *incomplete* patch introduces a programmable Qdisc with
eBPF. The goal is to make this Qdisc as programmable as possible,
that is, to replace as many existing Qdisc's as we can, no matter
in tree or out of tree. And we want to make programmer's and researcher's
life as easy as possible, so that they don't have to write a complete
Qdisc kernel module just to experiment some queuing theory.
The design was discussed during last LPC:
https://linuxplumbersconf.org/event/7/contributions/679/attachments/520/1188/sch_bpf.pdf
Here is a summary of design decisions I made:
1. Avoid eBPF struct_ops, as it would be really hard to program
a Qdisc with this approach, literally all the struct Qdisc_ops
and struct Qdisc_class_ops are needed to implement. This is almost
as hard as programming a Qdisc kernel module.
2. Introduce skb map, which will allow other eBPF programs to store skb's
too.
a) As eBPF maps are not directly visible to the kernel, we have to
dump the stats via eBPF map API's instead of netlink.
b) The user-space is not allowed to read the entire packets, only __sk_buff
itself is readable, because we don't have such a use case yet and it would
require a different API to read the data, as map values have fixed length.
c) Two eBPF helpers are introduced for skb map operations:
bpf_skb_map_enqueue() and bpf_skb_map_dequeue(). Normal map update is
not allowed.
d) Multi-queue support should be done via map-in-map. This is TBD.
e) Use the netdevice notifier to reset the packets inside skb map upon
NETDEV_DOWN event.
3. Integrate with existing TC infra. For example, if the user doesn't want
to implement her own filters (e.g. a flow dissector), she should be able
to re-use the existing TC filters. Another helper bpf_skb_classify() is
introduced for this purpose.
Although the biggest limitation is obviously that users can not traverse
the packets or flows inside the Qdisc, I think at least they could store
those global information of interest inside their own hashmap.
TBD: should we introduce an eBPF program for skb map which allows users to
sort the packets?
Any high-level feedbacks are welcome. Please kindly do not review any coding
details until RFC tag is removed.
TODO:
1. actually test it
2. write a document for this Qdisc
3. add test cases and sample code
Cc: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Jamal Hadi Salim <jhs@...atatu.com>
Cc: Jiri Pirko <jiri@...nulli.us>
Signed-off-by: Cong Wang <cong.wang@...edance.com>
---
v3: move priority queue from sch_bpf to skb map
introduce skb map and its helpers
introduce bpf_skb_classify()
use netdevice notifier to reset skb's
Rebase on latest bpf-next
v2: Rebase on latest net-next
Make the code more complete (but still incomplete)
Cong Wang (3):
introduce priority queue
bpf: introduce skb map
net_sched: introduce eBPF based Qdisc
include/linux/bpf_types.h | 2 +
include/linux/priority_queue.h | 90 ++++++
include/linux/skbuff.h | 2 +
include/uapi/linux/bpf.h | 15 +
include/uapi/linux/pkt_sched.h | 17 ++
kernel/bpf/Makefile | 2 +-
kernel/bpf/skb_map.c | 244 +++++++++++++++
net/sched/Kconfig | 15 +
net/sched/Makefile | 1 +
net/sched/sch_bpf.c | 521 +++++++++++++++++++++++++++++++++
10 files changed, 908 insertions(+), 1 deletion(-)
create mode 100644 include/linux/priority_queue.h
create mode 100644 kernel/bpf/skb_map.c
create mode 100644 net/sched/sch_bpf.c
--
2.32.0
Powered by blists - more mailing lists