netdev - Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <YrYj3LPaHV7thgJW@google.com>
Date:   Fri, 24 Jun 2022 13:51:40 -0700
From:   sdf@...gle.com
To:     Cong Wang <xiyou.wangcong@...il.com>
Cc:     netdev@...r.kernel.org, bpf@...r.kernel.org,
        Cong Wang <cong.wang@...edance.com>,
        "Toke Høiland-Jørgensen" <toke@...hat.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Jiri Pirko <jiri@...nulli.us>
Subject: Re: [RFC Patch v5 0/5] net_sched: introduce eBPF based Qdisc

On 06/01, Cong Wang wrote:
> From: Cong Wang <cong.wang@...edance.com>

> This *incomplete* patchset introduces a programmable Qdisc with eBPF.

> There are a few use cases:

> 1. Allow customizing Qdisc's in an easier way. So that people don't
>     have to write a complete Qdisc kernel module just to experiment
>     some new queuing theory.

> 2. Solve EDT's problem. EDT calcuates the "tokens" in clsact which
>     is before enqueue, it is impossible to adjust those "tokens" after
>     packets get dropped in enqueue. With eBPF Qdisc, it is easy to
>     be solved with a shared map between clsact and sch_bpf.

> 3. Replace qevents, as now the user gains much more control over the
>     skb and queues.

> 4. Provide a new way to reuse TC filters. Currently TC relies on filter
>     chain and block to reuse the TC filters, but they are too complicated
>     to understand. With eBPF helper bpf_skb_tc_classify(), we can invoke
>     TC filters on _any_ Qdisc (even on a different netdev) to do the
>     classification.

> 5. Potentially pave a way for ingress to queue packets, although
>     current implementation is still only for egress.

> 6. Possibly pave a way for handling TCP protocol in TC, as rbtree itself
>     is already used by TCP to handle TCP retransmission.

> The goal here is to make this Qdisc as programmable as possible,
> that is, to replace as many existing Qdisc's as we can, no matter
> in tree or out of tree. This is why I give up on PIFO which has
> serious limitations on the programmablity.

> Here is a summary of design decisions I made:

> 1. Avoid eBPF struct_ops, as it would be really hard to program
>     a Qdisc with this approach, literally all the struct Qdisc_ops
>     and struct Qdisc_class_ops are needed to implement. This is almost
>     as hard as programming a Qdisc kernel module.

> 2. Introduce skb map, which will allow other eBPF programs to store skb's
>     too.

>     a) As eBPF maps are not directly visible to the kernel, we have to
>     dump the stats via eBPF map API's instead of netlink.

>     b) The user-space is not allowed to read the entire packets, only  
> __sk_buff
>     itself is readable, because we don't have such a use case yet and it  
> would
>     require a different API to read the data, as map values have fixed  
> length.

>     c) Two eBPF helpers are introduced for skb map operations:
>     bpf_skb_map_push() and bpf_skb_map_pop(). Normal map update is
>     not allowed.

>     d) Multi-queue support is implemented via map-in-map, in a similar
>     push/pop fasion.

>     e) Use the netdevice notifier to reset the packets inside skb map upon
>     NETDEV_DOWN event.

> 3. Integrate with existing TC infra. For example, if the user doesn't want
>     to implement her own filters (e.g. a flow dissector), she should be  
> able
>     to re-use the existing TC filters. Another helper  
> bpf_skb_tc_classify() is
>     introduced for this purpose.

> Any high-level feedback is welcome. Please kindly do not review any coding
> details until RFC tag is removed.

> TODO:
> 1. actually test it

Can you try to implement some existing qdisc using your new mechanism?
For BPF-CC, Martin showcased how dctcp/cubic can be reimplemented;
I feel like this patch series (even rfc), should also have a good example
to show that bpf qdisc is on par and can be used to at least implement
existing policies. fq/fq_codel/cake are good candidates.