[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <871qmzoo2r.fsf@nvidia.com>
Date: Wed, 08 Feb 2023 13:16:12 -0800
From: Rahul Rameshbabu <rrameshbabu@...dia.com>
To: Saeed Mahameed <saeedm@...dia.com>
Cc: Vadim Fedorenko <vadfed@...a.com>,
Saeed Mahameed <saeed@...nel.org>,
Jakub Kicinski <kuba@...nel.org>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Tariq Toukan <tariqt@...dia.com>
Subject: Re: [pull request][net 00/10] mlx5 fixes 2023-02-07
On Wed, 08 Feb, 2023 12:52:55 -0800 Saeed Mahameed <saeedm@...dia.com> wrote:
> Hi Vadim,
>
> We have some new findings internally and Rahul is testing your patches,
> he found some issues where the patches don't handle the case where only drops are happening, meanings no OOO.
>
> Rahul can share more details, he's still working on this and I believe we will have a fully detailed follow-up by the end of the week.
One thing I noticed was the conditional in mlx5e_ptp_ts_cqe_ooo in v5
does handle OOO but considers the monotomically increasing case of 1,3,4
for example to be OOO as well (a resync does not occur when I tested
this case).
A simple patch I made to verify this.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
index ae75e230170b..dfa5c53bd0d5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
@@ -125,6 +125,8 @@ static void mlx5e_ptp_handle_ts_cqe(struct mlx5e_ptpsq *ptpsq,
struct sk_buff *skb;
ktime_t hwtstamp;
+ pr_info("wqe_counter value: %u\n", skb_id);
+
if (unlikely(MLX5E_RX_ERR_CQE(cqe))) {
skb = mlx5e_skb_fifo_pop(&ptpsq->skb_fifo);
ptpsq->cq_stats->err_cqe++;
@@ -133,6 +135,7 @@ static void mlx5e_ptp_handle_ts_cqe(struct mlx5e_ptpsq *ptpsq,
if (mlx5e_ptp_ts_cqe_drop(ptpsq, skb_cc, skb_id)) {
if (mlx5e_ptp_ts_cqe_ooo(ptpsq, skb_id)) {
+ pr_info("Marked ooo wqe_counter: %u\n", skb_id);
/* already handled by a previous resync */
ptpsq->cq_stats->ooo_cqe_drop++;
return;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f7897ddb29c5..8582f0535e21 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -646,7 +646,7 @@ static void mlx5e_cqe_ts_id_eseg(struct mlx5e_ptpsq *ptpsq, struct sk_buff *skb,
struct mlx5_wqe_eth_seg *eseg)
{
if (ptpsq->ts_cqe_ctr_mask && unlikely(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP))
- eseg->flow_table_metadata = cpu_to_be32(ptpsq->skb_fifo_pc &
+ eseg->flow_table_metadata = cpu_to_be32((ptpsq->skb_fifo_pc * 2) &
ptpsq->ts_cqe_ctr_mask);
}
Basically, I multiply the wqe_counter written in the WQE by two. The
thing here is we have a situation where we have "lost" a CQE with
wqe_counter index of one, but the patch treats that as OOO, which
basically disables our normal resiliency path for resyncs on drops. At
that point, the patch could just remove the resync logic altogether when
a drop is detected.
What I noticed then was that the case of 0,2 was marked as OOO even
though out of order would be something like 0,2,1.
[Feb 8 02:40] wqe_counter value: 0
[ +24.199404] wqe_counter value: 2
[ +0.001041] Marked ooo wqe_counter: 2
I acknowledge the OOO issue but not sure the patch as is, correctly
solves the issue.
>
> Sorry for the late update but these new findings are only from yesterday.
>
> Thanks,
> Saeed.
>
>
> -------------------------------------------------------------------------------------------------------------------------
> From: Vadim Fedorenko <vadfed@...a.com>
> Sent: Wednesday, February 8, 2023 4:40 AM
> To: Saeed Mahameed <saeed@...nel.org>; Jakub Kicinski <kuba@...nel.org>
> Cc: Saeed Mahameed <saeedm@...dia.com>; netdev@...r.kernel.org <netdev@...r.kernel.org>; Tariq Toukan <tariqt@...dia.com>
> Subject: Re: [pull request][net 00/10] mlx5 fixes 2023-02-07
>
> On 08/02/2023 03:02, Saeed Mahameed wrote:
>> From: Saeed Mahameed <saeedm@...dia.com>
>>
>> This series provides bug fixes to mlx5 driver.
>> Please pull and let me know if there is any problem.
>>
> Still no patches for PTP queue? That's a bit wierd.
> Do you think that they are not ready to be in -net?
-- Rahul Rameshbabu
Powered by blists - more mailing lists