linux-kernel - Re: [PATCH net-next] net: introduce budget

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoByPYTcm11cPMoVww_Ba4pv3ApD7dJRjNGsOiHCuNaGiA@mail.gmail.com>
Date:   Tue, 14 Mar 2023 17:21:49 +0800
From:   Jason Xing <kerneljasonxing@...il.com>
To:     Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     Kui-Feng Lee <sinquersw@...il.com>, brouer@...hat.com,
        davem@...emloft.net, edumazet@...gle.com, kuba@...nel.org,
        pabeni@...hat.com, ast@...nel.org, daniel@...earbox.net,
        hawk@...nel.org, john.fastabend@...il.com, kuniyu@...zon.com,
        liuhangbin@...il.com, xiangxia.m.yue@...il.com, jiri@...dia.com,
        andy.ren@...cruise.com, bpf@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        Jason Xing <kernelxing@...cent.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Simon Sundberg <Simon.Sundberg@....se>
Subject: Re: [PATCH net-next] net: introduce budget_squeeze to help us tune rx behavior

On Tue, Mar 14, 2023 at 4:41 PM Jesper Dangaard Brouer
<jbrouer@...hat.com> wrote:
>
>
> On 14/03/2023 02.57, Jason Xing wrote:
> > On Tue, Mar 14, 2023 at 5:58 AM Kui-Feng Lee <sinquersw@...il.com> wrote:
> >>
> >> On 3/11/23 08:36, Jason Xing wrote:
> >>> From: Jason Xing <kernelxing@...cent.com>
> >>>
> >>> When we encounter some performance issue and then get lost on how
> >>> to tune the budget limit and time limit in net_rx_action() function,
> >>> we can separately counting both of them to avoid the confusion.
> >>>
> >>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
> >>> ---
> >>> note: this commit is based on the link as below:
> >>> https://lore.kernel.org/lkml/20230311151756.83302-1-kerneljasonxing@gmail.com/
> >>> ---
> [...]
> >>> diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
> >>> index 97a304e1957a..4d1a499d7c43 100644
> >>> --- a/net/core/net-procfs.c
> >>> +++ b/net/core/net-procfs.c
> >>> @@ -174,14 +174,17 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
> >>>         */
> >>>        seq_printf(seq,
> >>>                   "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x "
> >>> -                "%08x %08x\n",
> >>> -                sd->processed, sd->dropped, sd->time_squeeze, 0,
> >>> +                "%08x %08x %08x %08x\n",
> >>> +                sd->processed, sd->dropped,
> >>> +                0, /* was old way to count time squeeze */
> >>
> >> Should we show a proximate number?  For example,
> >> sd->time_squeeze + sd->bud_squeeze.
> >
> > Yeah, It does make sense. Let the old way to display untouched.
> >
>
[...]
> Yes, I don't think we can/should remove this squeeze stat because
> several tools e.g. my own[1] captures these stats (and I know Willem
> also have his own tool).
> I like the sd->time_squeeze + sd->budget_squeeze suggestion.

So do I. Therefore I followed this suggestion in the next submission.

[1]
https://lore.kernel.org/lkml/20230314030532.9238-3-kerneljasonxing@gmail.com/

>
>   [1]
> https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl
>
>
> >>
> >>
> >>> +                0,
> >>>                   0, 0, 0, 0, /* was fastroute */
> >>>                   0,   /* was cpu_collision */
> >>>                   sd->received_rps, flow_limit_count,
> >>>                   0,   /* was len of two backlog queues */
> >>>                   (int)seq->index,
> >>> -                softnet_input_pkt_queue_len(sd), softnet_process_queue_len(sd));
> >>> +                softnet_input_pkt_queue_len(sd), softnet_process_queue_len(sd),
> >>> +                sd->time_squeeze, sd->budget_squeeze);
> >>>        return 0;
> >>>    }
> >>>
>
[...]
> We recently had a very long troubleshooting session around a latency
> issue (Cc Simon) where we used the tool[1].  The issue was NIC hardware
> RX queue was backlogged, but we didn't see any squeeze events, which
> confused us. (This happens because budget was 300 and two NICs using 64
> budget each doesn't exceed 300).

I recently found some users running on our production environment hit
the time_squeeze very often which aroused my interests.
Env:
1) budget is 300;
2) eth0 is virtio_net which only registers 32 input interrupts (32
queue pairs) with a larger number of cpus online.

>
> We were/are missing another counter to tell us net_rx_action() "repoll"
> is happening (as code !list_empty(&repoll)).  That were the case and it
> would have "told" us that hardware RX ring was full (larger than 64).
>
> We worked around this limitation by using the tracepoint for napi_poll,
> and manually deduced that 64 bulking must mean that "repoll" were happening.
>
> Oneliner bpftrace script:
>
>   bpftrace -e 'tracepoint:napi:napi_poll {
> @napi_rx_bulk[str(args->dev_name)] = lhist(args->work, 0, 64, 4); }'
>
> We used this script (that also measures softirq latency):
>
>
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/napi_monitor.bt
>
>
[...]
> I do wonder is it would be valuable to *also* add a tracepoint to
> net_rx_action, that expose sd->time_squeeze, sd->budget_squeeze and
> repoll-not-empty.

I believe it's useful that we can show more details in softnet_data,
but I'm confused about how to display them.
This morning I submitted one patch[1] and chose to do such things when
reading the softnet_stat file.

Could we add more data in the softnet_stat file while also tracing
those three important points? I'm not sure.

Thanks,
Jason

>
> --Jesper
>