linux-kernel - Re: [PATCH net-next] net: introduce budget

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3a103f0b-7c90-b9f5-0337-22ef46eba1a5@redhat.com>
Date:   Tue, 14 Mar 2023 09:39:35 +0100
From:   Jesper Dangaard Brouer <jbrouer@...hat.com>
To:     Jason Xing <kerneljasonxing@...il.com>,
        Kui-Feng Lee <sinquersw@...il.com>
Cc:     brouer@...hat.com, davem@...emloft.net, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com, ast@...nel.org,
        daniel@...earbox.net, hawk@...nel.org, john.fastabend@...il.com,
        kuniyu@...zon.com, liuhangbin@...il.com, xiangxia.m.yue@...il.com,
        jiri@...dia.com, andy.ren@...cruise.com, bpf@...r.kernel.org,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        Jason Xing <kernelxing@...cent.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Simon Sundberg <Simon.Sundberg@....se>
Subject: Re: [PATCH net-next] net: introduce budget_squeeze to help us tune rx
 behavior


On 14/03/2023 02.57, Jason Xing wrote:
> On Tue, Mar 14, 2023 at 5:58 AM Kui-Feng Lee <sinquersw@...il.com> wrote:
>>
>> On 3/11/23 08:36, Jason Xing wrote:
>>> From: Jason Xing <kernelxing@...cent.com>
>>>
>>> When we encounter some performance issue and then get lost on how
>>> to tune the budget limit and time limit in net_rx_action() function,
>>> we can separately counting both of them to avoid the confusion.
>>>
>>> Signed-off-by: Jason Xing <kernelxing@...cent.com>
>>> ---
>>> note: this commit is based on the link as below:
>>> https://lore.kernel.org/lkml/20230311151756.83302-1-kerneljasonxing@gmail.com/
>>> ---
[...]
>>> diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
>>> index 97a304e1957a..4d1a499d7c43 100644
>>> --- a/net/core/net-procfs.c
>>> +++ b/net/core/net-procfs.c
>>> @@ -174,14 +174,17 @@ static int softnet_seq_show(struct seq_file *seq, void *v)
>>>         */
>>>        seq_printf(seq,
>>>                   "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x "
>>> -                "%08x %08x\n",
>>> -                sd->processed, sd->dropped, sd->time_squeeze, 0,
>>> +                "%08x %08x %08x %08x\n",
>>> +                sd->processed, sd->dropped,
>>> +                0, /* was old way to count time squeeze */
>>
>> Should we show a proximate number?  For example,
>> sd->time_squeeze + sd->bud_squeeze.
> 
> Yeah, It does make sense. Let the old way to display untouched.
>

Yes, I don't think we can/should remove this squeeze stat because
several tools e.g. my own[1] captures these stats (and I know Willem
also have his own tool).
I like the sd->time_squeeze + sd->budget_squeeze suggestion.

  [1] 
https://github.com/netoptimizer/network-testing/blob/master/bin/softnet_stat.pl


>>
>>
>>> +                0,
>>>                   0, 0, 0, 0, /* was fastroute */
>>>                   0,   /* was cpu_collision */
>>>                   sd->received_rps, flow_limit_count,
>>>                   0,   /* was len of two backlog queues */
>>>                   (int)seq->index,
>>> -                softnet_input_pkt_queue_len(sd), softnet_process_queue_len(sd));
>>> +                softnet_input_pkt_queue_len(sd), softnet_process_queue_len(sd),
>>> +                sd->time_squeeze, sd->budget_squeeze);
>>>        return 0;
>>>    }
>>>

We recently had a very long troubleshooting session around a latency
issue (Cc Simon) where we used the tool[1].  The issue was NIC hardware
RX queue was backlogged, but we didn't see any squeeze events, which
confused us. (This happens because budget was 300 and two NICs using 64
budget each doesn't exceed 300).

We were/are missing another counter to tell us net_rx_action() "repoll"
is happening (as code !list_empty(&repoll)).  That were the case and it
would have "told" us that hardware RX ring was full (larger than 64).

We worked around this limitation by using the tracepoint for napi_poll,
and manually deduced that 64 bulking must mean that "repoll" were happening.

Oneliner bpftrace script:

  bpftrace -e 'tracepoint:napi:napi_poll { 
@napi_rx_bulk[str(args->dev_name)] = lhist(args->work, 0, 64, 4); }'

We used this script (that also measures softirq latency):

 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/napi_monitor.bt 


I do wonder is it would be valuable to *also* add a tracepoint to
net_rx_action, that expose sd->time_squeeze, sd->budget_squeeze and
repoll-not-empty.

--Jesper