[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoCoS1h_iA4-O=teR0zbNK3QyAFh_wH+g22635Gc1_mTGg@mail.gmail.com>
Date: Tue, 21 Mar 2023 10:08:39 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc: Jakub Kicinski <kuba@...nel.org>, brouer@...hat.com,
davem@...emloft.net, edumazet@...gle.com, pabeni@...hat.com,
ast@...nel.org, daniel@...earbox.net, hawk@...nel.org,
john.fastabend@...il.com, stephen@...workplumber.org,
simon.horman@...igine.com, sinquersw@...il.com,
bpf@...r.kernel.org, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [PATCH v4 net-next 2/2] net: introduce budget_squeeze to help us
tune rx behavior
On Mon, Mar 20, 2023 at 9:30 PM Jesper Dangaard Brouer
<jbrouer@...hat.com> wrote:
>
>
> On 17/03/2023 05.11, Jason Xing wrote:
> > On Fri, Mar 17, 2023 at 11:26 AM Jakub Kicinski <kuba@...nel.org> wrote:
> >>
> >> On Fri, 17 Mar 2023 10:27:11 +0800 Jason Xing wrote:
> >>>> That is the common case, and can be understood from the napi trace
> >>>
> >>> Thanks for your reply. It is commonly happening every day on many servers.
> >>
> >> Right but the common issue is the time squeeze, not budget squeeze,
> >
> > Most of them are about time, so yes.
> >
> >> and either way the budget squeeze doesn't really matter because
> >> the softirq loop will call us again soon, if softirq itself is
> >> not scheduled out.
> >>
>
[...]
> I agree, the budget squeeze count doesn't provide much value as it
> doesn't indicate something critical (softirq loop will call us again
> soon). The time squeeze event is more critical and something that is
> worth monitoring.
>
> I see value in this patch, because it makes it possible monitor the time
> squeeze events. Currently the counter is "polluted" by the budget
> squeeze, making it impossible to get a proper time squeeze signal.
> Thus, I see this patch as a fix to a old problem.
>
> Acked-by: Jesper Dangaard Brouer <brouer@...hat.com>
Thanks for your acknowledgement. As you said, I didn't add more
functional change or performance related change, but make a little
change to previous output which needs to be more accurate. Even though
I would like to get it merged, I have to drop this patch and resend
another one. If maintainers really think it matters, I hope it will be
picked someday :)
>
> That said (see below), besides monitoring time squeeze counter, I
> recommend adding some BPF monitoring to capture latency issues...
>
> >> So if you want to monitor a meaningful event in your fleet, I think
> >> a better event to monitor is the number of times ksoftirqd was woken
> >> up and latency of it getting onto the CPU.
> >
> > It's a good point. Thanks for your advice.
>
> I'm willing to help you out writing a BPF-based tool that can help you
> identify the issue Jakub describe above. Of high latency from when
> softIRQ is raised until softIRQ processing runs on the CPU.
>
> I have this bpftrace script[1] available that does just that:
>
> [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
A few days ago, I did the same thing with bcc tools to handle those
complicated issues. Your bt script looks much more concise. Thanks
anyway.
> Perhaps you can take the latency historgrams and then plot a heatmap[2]
> in your monitoring platform.
>
> [2] https://www.brendangregg.com/heatmaps.html
>
> --Jesper
>
Powered by blists - more mailing lists