[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zzb_7hXRPgYMACb9@x130>
Date: Fri, 15 Nov 2024 00:01:50 -0800
From: Saeed Mahameed <saeedm@...dia.com>
To: Yafang Shao <laoar.shao@...il.com>
Cc: Jakub Kicinski <kuba@...nel.org>, ttoukan.linux@...il.com,
gal@...dia.com, tariqt@...dia.com, leon@...nel.org,
netdev@...r.kernel.org, linux-rdma@...r.kernel.org
Subject: Re: [PATCH v2 net-next] net/mlx5e: Report rx_discards_phy via
rx_fifo_errors
On 15 Nov 13:50, Yafang Shao wrote:
>On Fri, Nov 15, 2024 at 12:32 PM Jakub Kicinski <kuba@...nel.org> wrote:
>>
>> On Fri, 15 Nov 2024 11:56:38 +0800 Yafang Shao wrote:
>> > > On Thu, 14 Nov 2024 10:17:11 +0800 Yafang Shao wrote:
>> > > > - * Not recommended for use in drivers for high speed interfaces.
>> > >
>> > > I thought I suggested we provide clear guidance on this counter being
>> > > related to processing pipeline being to slow, vs host backpressure.
>> > > Just deleting the line that says "don't use" is not going to cut it :|
>> >
>> > Hello Jakub,
>> >
>> > After investigating other network drivers, I found that they all
>> > report this metric to rx_missed_errors:
>> >
>> > - i40e
>> > The corresponding ethtool metric is port.rx_discards, which was
>> > mapped to rx_missed_errors in commit 5337d2949733 ("i40e: Add
>> > rx_missed_errors for buffer exhaustion").
>> >
>> > - broadcom
>> > The equivalent metric is rx_total_discard_pkts, reported as
>> > rx_missed_errors in commit c0c050c58d84 ("bnxt_en: New Broadcom
>> > ethernet driver")
>> >
>> > Given this, it seems we should align with the standard practice and
>> > report this metric to rx_missed_errors.
>> >
>> > Tariq, what are your thoughts?
>>
>> mlx5 already reports rx_missed_errors and AFAIU rx_discards_phy are very
>> different kind of drops than the drops reported as 'missed'.
>> The distinction is useful in production in my experience working with
>> mlx5 devices.
Yes rx_missed_errors is lack of software descriptors, please don't mix it
with hardware pipeline FIFO discards.
FYI: mlx5 reports more discards related to pipeline see below,
especially for per PF/VF buffers. When these are advancing, usually they
indicate congestion control issues, for example pause frames is off.
rx_prio[p]_buf_discard
The number of packets discarded by device due to lack of per host receive buffers.
rx_prio[p]_cong_discard
The number of packets discarded by device due to per host congestion.
rx_prio[p]_discard (rx_discard_phy is the sum of all rx_prio[p]_discard)
The number of packets discarded by device due to lack of receive buffers.
That being said, these are not errors, so reporting them via rx_xyz_error
is very misleading, rx_missed_errors is a special case though, and let's
keep it that way.
>
>>From the manual [0], it says :
>
>The number of received packets dropped due to lack of buffers on a
>physical port. If this counter is increasing, it implies that the
>adapter is congested and cannot absorb the traffic coming from the
>network.
>
>Would it be possible to add this description to if_link.h?
>
>Frankly, it doesn’t make much difference to end users like me whether
>this is reported to rx_missed_errors or rx_fifo_errors; the main goal
>is simply to monitor this metric to flag any issues...
>
not rx_missed_errors please, it is exclusive for software lack of buffers.
Please have a look at thtool_eth_XXX_stats IEEE ethnl_stats, if you need to
extend, this is the place.
RFC2863[1] defines this type of discards as ifInDiscards. So let's add
it to ehttool std stats. mlx5 reports most of them already to driver custom
ethtool -S
[1] https://datatracker.ietf.org/doc/html/rfc2863
- Saeed
>[0]. https://enterprise-support.nvidia.com/s/article/understanding-mlx5-ethtool-counters
>
>
>--
>Regards
>Yafang
Powered by blists - more mailing lists