netdev - Re: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y0BhnccTKfXORS5E@google.com>
Date:   Fri, 7 Oct 2022 10:27:57 -0700
From:   sdf@...gle.com
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     netdev@...r.kernel.org, yangpeihao@...u.edu.cn, toke@...hat.com,
        jhs@...atatu.com, jiri@...nulli.us, bpf@...r.kernel.org,
        Cong Wang <cong.wang@...edance.com>
Subject: Re: [RFC Patch v6 0/5] net_sched: introduce eBPF based Qdisc

On 10/05, Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> 7. Potentially pave a way for implementing eBPF based CPU scheduler,
>     because we already have task kptr and CFS is clearly based on rbtree
>     too.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity. More importantly, with
> rbtree, it is easy to implement PIFO logic too.

> Here is a summary of design decisions I made:

[..]

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

Can you expand more on what's exactly 'hard'? We need to override
enqueue/dequeue mostly and that's it?

Does it make sense to at least do, say, sch_struct_opts? Maybe we can
have some higher level struct_opts with enqueue/dequeue only for
bpf to override?

> 2. Introduce a generic rbtree map, which supports kptr, so that skb's
>     can be stored inside.

There has been several rbtree proposals. Do we really need an rbtree
here (ignoring performance)? Can we demonstrate these new qdisc
capabilities with a hashtable?

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

[..]

>     c) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion based on rbtree too.

I took a brief look at sch_bpf.c and I don't see how that would work.
Can you expand more on that (+locking)?

Also, in sch_bpf.c, you care about classid, why? Isn't the whole purpose
of the series is to expand tc capabilities? Why are we still dragging
this classid?


>     d) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event. (TODO: need to recognize kptr type)

[..]

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

Supposedly, the idea is that bpf can implement all existing policies +
more, so why are we trying to re-use existing filters?

> Any high-level feedback is welcome. Please kindly ignore any coding  
> details
> until RFC tag is removed.

> TODO:
> 1. actually test it
> 2. write a document for this Qdisc

[..]

> 3. add test cases and sample code

I'd start with that; otherwise, it's not clear what are the current
capabilities.

> Cc: Toke H�iland-J�rgensen <toke@...hat.com>
> Cc: Jamal Hadi Salim <jhs@...atatu.com>
> Cc: Jiri Pirko <jiri@...nulli.us>
> Signed-off-by: Cong Wang <cong.wang@...edance.com>
> ---
> v6: switch to kptr based approach

> v5: mv kernel/bpf/skb_map.c net/core/skb_map.c
>      implement flow map as map-in-map
>      rename bpf_skb_tc_classify() and move it to net/sched/cls_api.c
>      clean up eBPF qdisc program context

> v4: get rid of PIFO, use rbtree directly

> v3: move priority queue from sch_bpf to skb map
>      introduce skb map and its helpers
>      introduce bpf_skb_classify()
>      use netdevice notifier to reset skb's
>      Rebase on latest bpf-next

> v2: Rebase on latest net-next
>      Make the code more complete (but still incomplete)

> Cong Wang (5):
>    bpf: Introduce rbtree map
>    bpf: Add map in map support to rbtree
>    net_sched: introduce eBPF based Qdisc
>    net_sched: Add kfuncs for storing skb
>    net_sched: Introduce helper bpf_skb_tc_classify()

>   include/linux/bpf.h            |   4 +
>   include/linux/bpf_types.h      |   4 +
>   include/uapi/linux/bpf.h       |  19 ++
>   include/uapi/linux/pkt_sched.h |  17 +
>   kernel/bpf/Makefile            |   2 +-
>   kernel/bpf/rbtree.c            | 603 +++++++++++++++++++++++++++++++++
>   kernel/bpf/syscall.c           |   7 +
>   net/core/filter.c              |  27 ++
>   net/sched/Kconfig              |  15 +
>   net/sched/Makefile             |   1 +
>   net/sched/cls_api.c            |  69 ++++
>   net/sched/sch_bpf.c            | 564 ++++++++++++++++++++++++++++++
>   12 files changed, 1331 insertions(+), 1 deletion(-)
>   create mode 100644 kernel/bpf/rbtree.c
>   create mode 100644 net/sched/sch_bpf.c

> --
> 2.34.1