lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231107230822.371443-23-ankur.a.arora@oracle.com>
Date:   Tue,  7 Nov 2023 15:08:15 -0800
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     linux-kernel@...r.kernel.org
Cc:     tglx@...utronix.de, peterz@...radead.org,
        torvalds@...ux-foundation.org, paulmck@...nel.org,
        linux-mm@...ck.org, x86@...nel.org, akpm@...ux-foundation.org,
        luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, willy@...radead.org, mgorman@...e.de,
        jon.grimm@....com, bharata@....com, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        jgross@...e.com, andrew.cooper3@...rix.com, mingo@...nel.org,
        bristot@...nel.org, mathieu.desnoyers@...icios.com,
        geert@...ux-m68k.org, glaubitz@...sik.fu-berlin.de,
        anton.ivanov@...bridgegreys.com, mattst88@...il.com,
        krypton@...ich-teichert.org, rostedt@...dmis.org,
        David.Laight@...LAB.COM, richard@....at, mjguzik@...il.com,
        Ankur Arora <ankur.a.arora@...cle.com>,
        Marek Lindner <mareklindner@...mailbox.ch>,
        Simon Wunderlich <sw@...onwunderlich.de>,
        Antonio Quartulli <a@...table.cc>,
        Sven Eckelmann <sven@...fation.org>,
        "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Roopa Prabhu <roopa@...dia.com>,
        Nikolay Aleksandrov <razor@...ckwall.org>,
        David Ahern <dsahern@...nel.org>,
        Pablo Neira Ayuso <pablo@...filter.org>,
        Jozsef Kadlecsik <kadlec@...filter.org>,
        Florian Westphal <fw@...len.de>,
        Willem de Bruijn <willemdebruijn.kernel@...il.com>,
        Matthieu Baerts <matttbe@...nel.org>,
        Mat Martineau <martineau@...nel.org>,
        Marcelo Ricardo Leitner <marcelo.leitner@...il.com>,
        Xin Long <lucien.xin@...il.com>,
        Trond Myklebust <trond.myklebust@...merspace.com>,
        Anna Schumaker <anna@...nel.org>,
        Jon Maloy <jmaloy@...hat.com>,
        Ying Xue <ying.xue@...driver.com>,
        Martin Schiller <ms@....tdt.de>
Subject: [RFC PATCH 79/86] treewide: net: remove cond_resched()

There are broadly three sets of uses of cond_resched():

1.  Calls to cond_resched() out of the goodness of our heart,
    otherwise known as avoiding lockup splats.

2.  Open coded variants of cond_resched_lock() which call
    cond_resched().

3.  Retry or error handling loops, where cond_resched() is used as a
    quick alternative to spinning in a tight-loop.

When running under a full preemption model, the cond_resched() reduces
to a NOP (not even a barrier) so removing it obviously cannot matter.

But considering only voluntary preemption models (for say code that
has been mostly tested under those), for set-1 and set-2 the
scheduler can now preempt kernel tasks running beyond their time
quanta anywhere they are preemptible() [1]. Which removes any need
for these explicitly placed scheduling points.

The cond_resched() calls in set-3 are a little more difficult.
To start with, given it's NOP character under full preemption, it
never actually saved us from a tight loop.
With voluntary preemption, it's not a NOP, but it might as well be --
for most workloads the scheduler does not have an interminable supply
of runnable tasks on the runqueue.

So, cond_resched() is useful to not get softlockup splats, but not
terribly good for error handling. Ideally, these should be replaced
with some kind of timed or event wait.
For now we use cond_resched_stall(), which tries to schedule if
possible, and executes a cpu_relax() if not.

Most of the uses here are in set-1 (some right after we give up a
lock or enable bottom-halves, causing an explicit preemption check.)

We can remove all of them.

[1] https://lore.kernel.org/lkml/20231107215742.363031-1-ankur.a.arora@oracle.com/

Cc: Marek Lindner <mareklindner@...mailbox.ch> 
Cc: Simon Wunderlich <sw@...onwunderlich.de> 
Cc: Antonio Quartulli <a@...table.cc> 
Cc: Sven Eckelmann <sven@...fation.org> 
Cc: "David S. Miller" <davem@...emloft.net> 
Cc: Eric Dumazet <edumazet@...gle.com> 
Cc: Jakub Kicinski <kuba@...nel.org> 
Cc: Paolo Abeni <pabeni@...hat.com> 
Cc: Roopa Prabhu <roopa@...dia.com> 
Cc: Nikolay Aleksandrov <razor@...ckwall.org> 
Cc: David Ahern <dsahern@...nel.org> 
Cc: Pablo Neira Ayuso <pablo@...filter.org> 
Cc: Jozsef Kadlecsik <kadlec@...filter.org> 
Cc: Florian Westphal <fw@...len.de> 
Cc: Willem de Bruijn <willemdebruijn.kernel@...il.com> 
Cc: Matthieu Baerts <matttbe@...nel.org> 
Cc: Mat Martineau <martineau@...nel.org> 
Cc: Marcelo Ricardo Leitner <marcelo.leitner@...il.com> 
Cc: Xin Long <lucien.xin@...il.com> 
Cc: Trond Myklebust <trond.myklebust@...merspace.com> 
Cc: Anna Schumaker <anna@...nel.org> 
Cc: Jon Maloy <jmaloy@...hat.com> 
Cc: Ying Xue <ying.xue@...driver.com> 
Cc: Martin Schiller <ms@....tdt.de> 
Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---
 net/batman-adv/tp_meter.c       |  2 --
 net/bpf/test_run.c              |  1 -
 net/bridge/br_netlink.c         |  1 -
 net/ipv6/fib6_rules.c           |  1 -
 net/ipv6/netfilter/ip6_tables.c |  2 --
 net/ipv6/udp.c                  |  2 --
 net/mptcp/mptcp_diag.c          |  2 --
 net/mptcp/pm_netlink.c          |  5 -----
 net/mptcp/protocol.c            |  1 -
 net/rds/ib_recv.c               |  2 --
 net/rds/tcp.c                   |  2 +-
 net/rds/threads.c               |  1 -
 net/rxrpc/call_object.c         |  2 +-
 net/sctp/socket.c               |  1 -
 net/sunrpc/cache.c              | 11 +++++++++--
 net/sunrpc/sched.c              |  2 +-
 net/sunrpc/svc_xprt.c           |  1 -
 net/sunrpc/xprtsock.c           |  2 --
 net/tipc/core.c                 |  2 +-
 net/tipc/topsrv.c               |  3 ---
 net/unix/af_unix.c              |  5 ++---
 net/x25/af_x25.c                |  1 -
 22 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/net/batman-adv/tp_meter.c b/net/batman-adv/tp_meter.c
index 7f3dd3c393e0..a0b160088c33 100644
--- a/net/batman-adv/tp_meter.c
+++ b/net/batman-adv/tp_meter.c
@@ -877,8 +877,6 @@ static int batadv_tp_send(void *arg)
 		/* right-shift the TWND */
 		if (!err)
 			tp_vars->last_sent += payload_len;
-
-		cond_resched();
 	}
 
 out:
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 0841f8d82419..f4558fdfdf74 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -81,7 +81,6 @@ static bool bpf_test_timer_continue(struct bpf_test_timer *t, int iterations,
 		/* During iteration: we need to reschedule between runs. */
 		t->time_spent += ktime_get_ns() - t->time_start;
 		bpf_test_timer_leave(t);
-		cond_resched();
 		bpf_test_timer_enter(t);
 	}
 
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 10f0d33d8ccf..f326b034245f 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -780,7 +780,6 @@ int br_process_vlan_info(struct net_bridge *br,
 					       v - 1, rtm_cmd);
 				v_change_start = 0;
 			}
-			cond_resched();
 		}
 		/* v_change_start is set only if the last/whole range changed */
 		if (v_change_start)
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 7c2003833010..528e6a582c21 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -500,7 +500,6 @@ static void __net_exit fib6_rules_net_exit_batch(struct list_head *net_list)
 	rtnl_lock();
 	list_for_each_entry(net, net_list, exit_list) {
 		fib_rules_unregister(net->ipv6.fib6_rules_ops);
-		cond_resched();
 	}
 	rtnl_unlock();
 }
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index fd9f049d6d41..704f14c4146f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -778,7 +778,6 @@ get_counters(const struct xt_table_info *t,
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
-			cond_resched();
 		}
 	}
 }
@@ -798,7 +797,6 @@ static void get_old_counters(const struct xt_table_info *t,
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
-		cond_resched();
 	}
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 86b5d509a468..032d4f7e6ed3 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -443,8 +443,6 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	}
 	kfree_skb(skb);
 
-	/* starting over for a new packet, but check if we need to yield */
-	cond_resched();
 	msg->msg_flags &= ~MSG_TRUNC;
 	goto try_again;
 }
diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c
index 8df1bdb647e2..82bf16511476 100644
--- a/net/mptcp/mptcp_diag.c
+++ b/net/mptcp/mptcp_diag.c
@@ -141,7 +141,6 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba
 		spin_unlock(&ilb->lock);
 		rcu_read_unlock();
 
-		cond_resched();
 		diag_ctx->l_num = 0;
 	}
 
@@ -190,7 +189,6 @@ static void mptcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			diag_ctx->s_num--;
 			break;
 		}
-		cond_resched();
 	}
 
 	if ((r->idiag_states & TCPF_LISTEN) && r->id.idiag_dport == 0)
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index 9661f3812682..b48d2636ce8d 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -1297,7 +1297,6 @@ static int mptcp_nl_add_subflow_or_signal_addr(struct net *net)
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1443,7 +1442,6 @@ static int mptcp_nl_remove_subflow_and_signal_addr(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1478,7 +1476,6 @@ static int mptcp_nl_remove_id_zero_address(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return 0;
@@ -1594,7 +1591,6 @@ static void mptcp_nl_remove_addrs_list(struct net *net,
 		}
 
 		sock_put(sk);
-		cond_resched();
 	}
 }
 
@@ -1878,7 +1874,6 @@ static int mptcp_nl_set_flags(struct net *net,
 
 next:
 		sock_put(sk);
-		cond_resched();
 	}
 
 	return ret;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 886ab689a8ae..8c4a51903b23 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3383,7 +3383,6 @@ static void mptcp_release_cb(struct sock *sk)
 		if (flags & BIT(MPTCP_RETRANSMIT))
 			__mptcp_retrans(sk);
 
-		cond_resched();
 		spin_lock_bh(&sk->sk_lock.slock);
 	}
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index e53b7f266bd7..d2111e895a10 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -459,8 +459,6 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	    rds_ib_ring_empty(&ic->i_recv_ring))) {
 		queue_delayed_work(rds_wq, &conn->c_recv_w, 1);
 	}
-	if (can_wait)
-		cond_resched();
 }
 
 /*
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 2dba7505b414..9b4d07235904 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -530,7 +530,7 @@ static void rds_tcp_accept_worker(struct work_struct *work)
 					       rds_tcp_accept_w);
 
 	while (rds_tcp_accept_one(rtn->rds_tcp_listen_sock) == 0)
-		cond_resched();
+		cond_resched_stall();
 }
 
 void rds_tcp_accept_work(struct sock *sk)
diff --git a/net/rds/threads.c b/net/rds/threads.c
index 1f424cbfcbb4..2a75b48769e8 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -198,7 +198,6 @@ void rds_send_worker(struct work_struct *work)
 	if (rds_conn_path_state(cp) == RDS_CONN_UP) {
 		clear_bit(RDS_LL_SEND_FULL, &cp->cp_flags);
 		ret = rds_send_xmit(cp);
-		cond_resched();
 		rdsdebug("conn %p ret %d\n", cp->cp_conn, ret);
 		switch (ret) {
 		case -EAGAIN:
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c
index 773eecd1e979..d2704a492a3c 100644
--- a/net/rxrpc/call_object.c
+++ b/net/rxrpc/call_object.c
@@ -755,7 +755,7 @@ void rxrpc_destroy_all_calls(struct rxrpc_net *rxnet)
 			       call->flags, call->events);
 
 			spin_unlock(&rxnet->call_lock);
-			cond_resched();
+			cpu_relax();
 			spin_lock(&rxnet->call_lock);
 		}
 
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 7f89e43154c0..448112919848 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -8364,7 +8364,6 @@ static int sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
 			break;
 		next:
 			spin_unlock_bh(&head->lock);
-			cond_resched();
 		} while (--remaining > 0);
 
 		/* Exhausted local port range during search? */
diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index 95ff74706104..3bcacfbbf35f 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -521,10 +521,17 @@ static void do_cache_clean(struct work_struct *work)
  */
 void cache_flush(void)
 {
+	/*
+	 * We call cache_clean() in what is seemingly a tight loop. But,
+	 * the scheduler can always preempt us when we give up the spinlock
+	 * in cache_clean().
+	 */
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
+
 	while (cache_clean() != -1)
-		cond_resched();
+		;
 }
 EXPORT_SYMBOL_GPL(cache_flush);
 
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 6debf4fd42d4..5b7a3c8a271f 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -950,7 +950,7 @@ static void __rpc_execute(struct rpc_task *task)
 		 * Lockless check for whether task is sleeping or not.
 		 */
 		if (!RPC_IS_QUEUED(task)) {
-			cond_resched();
+			cond_resched_stall();
 			continue;
 		}
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 4cfe9640df48..d2486645d725 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -851,7 +851,6 @@ void svc_recv(struct svc_rqst *rqstp)
 		goto out;
 
 	try_to_freeze();
-	cond_resched();
 	if (kthread_should_stop())
 		goto out;
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index a15bf2ede89b..50c1f2556b3e 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -776,7 +776,6 @@ static void xs_stream_data_receive(struct sock_xprt *transport)
 		if (ret < 0)
 			break;
 		read += ret;
-		cond_resched();
 	}
 	if (ret == -ESHUTDOWN)
 		kernel_sock_shutdown(transport->sock, SHUT_RDWR);
@@ -1412,7 +1411,6 @@ static void xs_udp_data_receive(struct sock_xprt *transport)
 			break;
 		xs_udp_data_read_skb(&transport->xprt, sk, skb);
 		consume_skb(skb);
-		cond_resched();
 	}
 	xs_poll_check_readable(transport);
 out:
diff --git a/net/tipc/core.c b/net/tipc/core.c
index 434e70eabe08..ed4cd5faa387 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -119,7 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
 	tipc_crypto_stop(&tipc_net(net)->crypto_tx);
 #endif
 	while (atomic_read(&tn->wq_count))
-		cond_resched();
+		cond_resched_stall();
 }
 
 static void __net_exit tipc_pernet_pre_exit(struct net *net)
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 8ee0c07d00e9..13cd3816fb52 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -277,7 +277,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 			ret = kernel_sendmsg(con->sock, &msg, &iov,
 					     1, sizeof(*evt));
 			if (ret == -EWOULDBLOCK || ret == 0) {
-				cond_resched();
 				return;
 			} else if (ret < 0) {
 				return tipc_conn_close(con);
@@ -288,7 +287,6 @@ static void tipc_conn_send_to_sock(struct tipc_conn *con)
 
 		/* Don't starve users filling buffers */
 		if (++count >= MAX_SEND_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 		spin_lock_bh(&con->outqueue_lock);
@@ -426,7 +424,6 @@ static void tipc_conn_recv_work(struct work_struct *work)
 
 		/* Don't flood Rx machine */
 		if (++count >= MAX_RECV_MSG_COUNT) {
-			cond_resched();
 			count = 0;
 		}
 	}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..bb1367f93db2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1184,10 +1184,9 @@ static int unix_autobind(struct sock *sk)
 		unix_table_double_unlock(net, old_hash, new_hash);
 
 		/* __unix_find_socket_byname() may take long time if many names
-		 * are already in use.
+		 * are already in use. The unlock above would have allowed the
+		 * scheduler to preempt if preemption was needed.
 		 */
-		cond_resched();
-
 		if (ordernum == lastnum) {
 			/* Give up if all names seems to be in use. */
 			err = -ENOSPC;
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 0fb5143bec7a..2a6b05bcb53d 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -343,7 +343,6 @@ static unsigned int x25_new_lci(struct x25_neigh *nb)
 			lci = 0;
 			break;
 		}
-		cond_resched();
 	}
 
 	return lci;
-- 
2.31.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ