netdev - Re: [PATCH net-next 1/1] net/sched: Disambiguate verdict from return code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 2 Oct 2023 15:54:50 -0400
From: Jamal Hadi Salim <jhs@...atatu.com>
To: Daniel Borkmann <daniel@...earbox.net>
Cc: Victor Nogueira <victor@...atatu.com>, xiyou.wangcong@...il.com, jiri@...nulli.us, 
	davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com, 
	paulb@...dia.com, netdev@...r.kernel.org, kernel@...atatu.com, 
	martin.lau@...ux.dev, bpf@...r.kernel.org
Subject: Re: [PATCH net-next 1/1] net/sched: Disambiguate verdict from return code

Hi Daniel,
On Fri, Sep 29, 2023 at 11:48 AM Daniel Borkmann <daniel@...earbox.net> wrote:
>
> On 9/26/23 1:01 AM, Jamal Hadi Salim wrote:
> > On Fri, Sep 22, 2023 at 4:12 AM Daniel Borkmann <daniel@...earbox.net> wrote:
> >> On 9/20/23 1:20 AM, Jamal Hadi Salim wrote:
> >>> On Tue, Sep 19, 2023 at 6:15 PM Daniel Borkmann <daniel@...earbox.net> wrote:
> >>>> On 9/19/23 4:59 PM, Victor Nogueira wrote:
> [...]
> >>
> >> In the above case we don't have 'internal' errors which you want to trace, so I would
> >> also love to avoid the cost of zeroing struct tcf_result res which should be 3x 8b for
> >> every packet.
> >
> > We can move the zeroing inside tc_run() but we declare it in the same
> > spot as we do right now. You will still need to set res.verdict as
> > above.
> > Would that work for you?
>
> What I'm not following is that with the below you can avoid the unnecessary
> fast path cost (which is only for corner case which is almost never hit) and
> get even better visibility. Are you saying it doesn't work?

I am probably missing something:
-1/UNSPEC is a legit errno. And the main motivation here for this
patch is to disambiguate if it was -EPERM vs UNSPEC
Maybe that is what you are calling a "corner case"?

There are two options in my mind right now (since you are guaranteed
in tcx_run you will never return anything below UNSPEC):
1) we just have the switch statement invocation inside an inline
function and you can pass it sch_ret (for tcx case) and we'll pass it
res.verdit for tc_run() case.
2) is something is we leave tcx_run alone and we have something along
the lines of:

--------------
diff --git a/net/core/dev.c b/net/core/dev.c
index 1450f4741d9b..93613bce647c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3985,7 +3985,7 @@ sch_handle_ingress(struct sk_buff *skb, struct
packet_type **pt_prev, int *ret,
                   struct net_device *orig_dev, bool *another)
 {
        struct bpf_mprog_entry *entry =
rcu_dereference_bh(skb->dev->tcx_ingress);
-       struct tcf_result res = {0};
+       struct tcf_result res;
        int sch_ret;

        if (!entry)
@@ -4003,14 +4003,16 @@ sch_handle_ingress(struct sk_buff *skb, struct
packet_type **pt_prev, int *ret,
                if (sch_ret != TC_ACT_UNSPEC)
                        goto ingress_verdict;
        }
+
+       res.verdict = 0;
        sch_ret = tc_run(tcx_entry(entry), skb, &res);
        if (sch_ret < 0) {
                kfree_skb_reason(skb, SKB_DROP_REASON_TC_INGRESS_ERROR);
                *ret = NET_RX_DROP;
                return NULL;
        }
+       sch_ret = res.verdict;
 ingress_verdict:
-       switch (res.verdict) {
+       switch (sch_ret) {
        case TC_ACT_REDIRECT:
                /* skb_mac_header check was done by BPF, so we can
safely
                 * push the L2 header back before redirecting to another
-----------

on the drop reason - our thinking is to support drop_watch alongside
tracepoint given kfree_skb_reason exists already; if i am not mistaken
what you suggested would require us to create a new tracepoint?

cheers,
jamal

> >> I was more thinking like something below could be a better choice. I presume your main
> >> goal is to trace where these errors originated in the first place, so it might even be
> >> useful to capture the actual return code as well.
> >
> > The main motivation is a few syzkaller bugs which resulted in not
> > disambiguating between errors being returned and sometimes
> > TC_ACT_SHOT.
> >
> >> Then you can use perf script, bpf and whatnot to gather further insights into what
> >> happened while being less invasive and avoiding the need to extend struct tcf_result.
> >
> > We could use trace instead - the reason we have the skb reason is
> > being used in the other spots (does this trace require ebpf to be
> > usable?).
>
> No you can just use regular perf by attaching to the tracepoint, no need for using
> bpf at all here.
>
> >> This would be quite similar to trace_xdp_exception() as well, and I think you can guarantee
> >> that in fast path all errors are < TC_ACT_UNSPEC anyway.
> >
> > I am not sure i followed. 0 means success, result codes are returned in res now.
>
> What I was saying is that you don't need the struct change from the patch, but only
> the changes where you rework TC_ACT_SHOT into one of the -E<errors>, and then with
> the below you can pass this through an exception tracepoint.
>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 85df22f05c38..4089d195144d 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3925,6 +3925,10 @@ static int tc_run(struct tcx_entry *entry, struct sk_buff *skb)
> >>
> >>          mini_qdisc_bstats_cpu_update(miniq, skb);
> >>          ret = tcf_classify(skb, miniq->block, miniq->filter_list, &res, false);
> >> +       if (unlikely(ret < TC_ACT_UNSPEC)) {
> >> +               trace_tc_exception(skb->dev, skb->tc_at_ingress, ret);
> >> +               ret = TC_ACT_SHOT;
> >> +       }
> >>          /* Only tcf related quirks below. */
> >>          switch (ret) {
> >>          case TC_ACT_SHOT:
>
> Thanks,
> Daniel