netdev - [PATCH net-next 1/9] nfp: flower: increase cmesg reply timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190116030659.27755-2-jakub.kicinski@netronome.com>
Date:   Tue, 15 Jan 2019 19:06:51 -0800
From:   Jakub Kicinski <jakub.kicinski@...ronome.com>
To:     davem@...emloft.net
Cc:     oss-drivers@...ronome.com, netdev@...r.kernel.org,
        Fred Lotter <frederik.lotter@...ronome.com>
Subject: [PATCH net-next 1/9] nfp: flower: increase cmesg reply timeout

From: Fred Lotter <frederik.lotter@...ronome.com>

QA tests report occasional timeouts on REIFY message replies. Profiling
of the two cmesg reply types under burst conditions, with a 12-core host
under heavy cpu and io load (stress --cpu 12 --io 12), show both PHY MTU
change and REIFY replies can exceed the 10ms timeout. The maximum MTU
reply wait under burst is 16ms, while the maximum REIFY wait under 40 VF
burst is 12ms. Using a 4 VF REIFY burst results in an 8ms maximum wait.
A larger VF burst does increase the delay, but not in a linear enough
way to justify a scaled REIFY delay. The worse case values between
MTU and REIFY appears close enough to justify a common timeout. Pick a
conservative 40ms to make a safer future proof common reply timeout. The
delay only effects the failure case.

Change the REIFY timeout mechanism to use wait_event_timeout() instead
of wait_event_interruptible_timeout(), to match the MTU code. In the
current implementation, theoretically, a signal could interrupt the
REIFY waiting period, with a return code of ERESTARTSYS. However, this is
caught under the general timeout error code EIO. I cannot see the benefit
of exposing the REIFY waiting period to signals with such a short delay
(40ms), while the MTU mechnism does not use the same logic. In the absence
of any reply (wakeup() call), both reply types will wake up the task after
the timeout period. The REIFY timeout applies to the entire representor
group being instantiated (e.g. VFs), while the MTU timeout apples to a
single PHY MTU change.

Signed-off-by: Fred Lotter <frederik.lotter@...ronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@...ronome.com>
---
 .../net/ethernet/netronome/nfp/flower/cmsg.c   |  2 +-
 .../net/ethernet/netronome/nfp/flower/cmsg.h   |  3 +++
 .../net/ethernet/netronome/nfp/flower/main.c   | 18 +++++++-----------
 3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
index 4c5eaf36d5bb..56b22ea32474 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.c
@@ -203,7 +203,7 @@ nfp_flower_cmsg_portreify_rx(struct nfp_app *app, struct sk_buff *skb)
 	}
 
 	atomic_inc(&priv->reify_replies);
-	wake_up_interruptible(&priv->reify_wait_queue);
+	wake_up(&priv->reify_wait_queue);
 }
 
 static void
diff --git a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
index 15f41cfef9f1..4fcaf11ed56e 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/cmsg.h
@@ -97,6 +97,9 @@
 
 #define NFP_FLOWER_WORKQ_MAX_SKBS	30000
 
+/* Cmesg reply (empirical) timeout*/
+#define NFP_FL_REPLY_TIMEOUT		msecs_to_jiffies(40)
+
 #define nfp_flower_cmsg_warn(app, fmt, args...)                         \
 	do {                                                            \
 		if (net_ratelimit())                                    \
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 5059110a1768..8ce20bd38965 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -107,16 +107,14 @@ static int
 nfp_flower_wait_repr_reify(struct nfp_app *app, atomic_t *replies, int tot_repl)
 {
 	struct nfp_flower_priv *priv = app->priv;
-	int err;
 
 	if (!tot_repl)
 		return 0;
 
 	lockdep_assert_held(&app->pf->lock);
-	err = wait_event_interruptible_timeout(priv->reify_wait_queue,
-					       atomic_read(replies) >= tot_repl,
-					       msecs_to_jiffies(10));
-	if (err <= 0) {
+	if (!wait_event_timeout(priv->reify_wait_queue,
+				atomic_read(replies) >= tot_repl,
+				NFP_FL_REPLY_TIMEOUT)) {
 		nfp_warn(app->cpp, "Not all reprs responded to reify\n");
 		return -EIO;
 	}
@@ -601,7 +599,7 @@ nfp_flower_repr_change_mtu(struct nfp_app *app, struct net_device *netdev,
 {
 	struct nfp_flower_priv *app_priv = app->priv;
 	struct nfp_repr *repr = netdev_priv(netdev);
-	int err, ack;
+	int err;
 
 	/* Only need to config FW for physical port MTU change. */
 	if (repr->port->type != NFP_PORT_PHYS_PORT)
@@ -628,11 +626,9 @@ nfp_flower_repr_change_mtu(struct nfp_app *app, struct net_device *netdev,
 	}
 
 	/* Wait for fw to ack the change. */
-	ack = wait_event_timeout(app_priv->mtu_conf.wait_q,
-				 nfp_flower_check_ack(app_priv),
-				 msecs_to_jiffies(10));
-
-	if (!ack) {
+	if (!wait_event_timeout(app_priv->mtu_conf.wait_q,
+				nfp_flower_check_ack(app_priv),
+				NFP_FL_REPLY_TIMEOUT)) {
 		spin_lock_bh(&app_priv->mtu_conf.lock);
 		app_priv->mtu_conf.requested_val = 0;
 		spin_unlock_bh(&app_priv->mtu_conf.lock);
-- 
2.19.2