netdev - [RFC PATCH] mlx5: Fix page rfcnt issue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170123235658.3872770-1-tom@herbertland.com>
Date:   Mon, 23 Jan 2017 15:56:58 -0800
From:   Tom Herbert <tom@...bertland.com>
To:     <netdev@...r.kernel.org>, <tariqt@...lanox.com>,
        <saeedm@...lanox.com>
CC:     <kernel-team@...com>
Subject: [RFC PATCH] mlx5: Fix page rfcnt issue

This patch is an FYI about possible issuses in mlx5.

There are two issues we discovered in the mlx5 backport from 4.9 to
4.6...

The bad behaviours we were seeing was a refcnts going to less than zero
and eventually killing hosts. We've only seen this running a real
application work load and it does take a while to trigger. The root
cause is unclear, however this patch seems to circumvent the issue. I
have not been able to reproduce this issue upstream because I haven't
been able to get the app running cleanly. At this point I cannot verify
if it is an upstream issue or not.

Specific issues that I have identified are:

1) In mlx5e_free_rx_mpwqe we are seeing cases where wi->skbs_frags[i] >
   pg_strides so that a negative refcnt is being subtracted. The warning
   at en_rx.c:409 of this patch does trigger.
2) Checksum faults are occurring. I have previously described this
   problem.

As for the checksum faults that seems to be an unrelated issue and I do
believe that was an upstream issue in 4.9, but may have been fixed in
4.10. This is easier to reproduce than the page refcnt issue-- just a
few concurrent netperf TCP_STREAMs was able to trigger the faults.

One other potential problem in the driver is the use of put_page in
release pages.  Comparing how the allocation is done in other drivers
(for instance comparing to ixgbe) some seem to use __free_pages instead.
I don't know which is correct to use, but somehow it doesn't seem like
they can both be right.

This patch:
1) We no longer do the page_ref_sub page_ref_add, instead we take the
   more traditional approach of just doing get_page when giving the
   page to an skb.
2) Add warnings to catch cases where operations on page refcnts are
   nonsensical
3) Comment out checksum-complete processing in order to squelch the
   checksum faults.

---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 20f116f..b24c2b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -209,6 +209,7 @@ static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
 	}
 
 	if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+		WARN_ON(page_ref_count(cache->page_cache[cache->head].page) <= 0);
 		rq->stats.cache_busy++;
 		return false;
 	}
@@ -292,6 +293,13 @@ static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
 				wi->umr.dma_info[page_idx].addr + frag_offset,
 				len, DMA_FROM_DEVICE);
 	wi->skbs_frags[page_idx]++;
+	WARN_ON(wi->skbs_frags[page_idx] > mlx5e_mpwqe_strides_per_page(rq));
+
+	/* Take a page reference every time we give page to skb (alternative
+	 * to original mlx code).
+	 */
+	get_page(wi->umr.dma_info[page_idx].page);
+
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
 			wi->umr.dma_info[page_idx].page, frag_offset,
 			len, truesize);
@@ -372,7 +380,6 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 		if (unlikely(err))
 			goto err_unmap;
 		wi->umr.mtt[i] = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
-		page_ref_add(dma_info->page, pg_strides);
 		wi->skbs_frags[i] = 0;
 	}
 
@@ -385,7 +392,6 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
 	while (--i >= 0) {
 		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
 
-		page_ref_sub(dma_info->page, pg_strides);
 		mlx5e_page_release(rq, dma_info, true);
 	}
 
@@ -400,7 +406,7 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
 		struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
 
-		page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
+		WARN_ON(pg_strides - wi->skbs_frags[i] < 0);
 		mlx5e_page_release(rq, dma_info, true);
 	}
 }
@@ -565,12 +571,17 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
 		return;
 	}
 
+#if 0
+	/* Disabled since we are seeing checksum faults occurring. This should
+	 * not have any noticeable impact (in the short term).
+	 */
 	if (is_first_ethertype_ip(skb)) {
 		skb->ip_summed = CHECKSUM_COMPLETE;
 		skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
 		rq->stats.csum_complete++;
 		return;
 	}
+#endif
 
 	if (likely((cqe->hds_ip_ext & CQE_L3_OK) &&
 		   (cqe->hds_ip_ext & CQE_L4_OK))) {
@@ -929,6 +940,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
 		return;
 
+	WARN_ON(wi->consumed_strides > rq->mpwqe_num_strides);
 	mlx5e_free_rx_mpwqe(rq, wi);
 	mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
 }
-- 
2.9.3