lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 16 Nov 2022 09:22:07 -0000
From:   "tip-bot2 for Tianchen Ding" <tip-bot2@...utronix.de>
To:     linux-tip-commits@...r.kernel.org
Cc:     Tianchen Ding <dtcccc@...ux.alibaba.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>, x86@...nel.org,
        linux-kernel@...r.kernel.org
Subject: [tip: sched/core] sched: Clear ttwu_pending after enqueue_task()

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Gitweb:        https://git.kernel.org/tip/d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991
Author:        Tianchen Ding <dtcccc@...ux.alibaba.com>
AuthorDate:    Fri, 04 Nov 2022 10:36:01 +08:00
Committer:     Peter Zijlstra <peterz@...radead.org>
CommitterDate: Wed, 16 Nov 2022 10:13:05 +01:00

sched: Clear ttwu_pending after enqueue_task()

We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f674555   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

We also tested on unixbench and hackbench, and saw no performance
change.

Signed-off-by: Tianchen Ding <dtcccc@...ux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Acked-by: Mel Gorman <mgorman@...e.de>
Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com
---
 kernel/sched/core.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 07ac08c..314c2c0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3739,13 +3739,6 @@ void sched_ttwu_pending(void *arg)
 	if (!llist)
 		return;
 
-	/*
-	 * rq::ttwu_pending racy indication of out-standing wakeups.
-	 * Races such that false-negatives are possible, since they
-	 * are shorter lived that false-positives would be.
-	 */
-	WRITE_ONCE(rq->ttwu_pending, 0);
-
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -3759,6 +3752,17 @@ void sched_ttwu_pending(void *arg)
 		ttwu_do_activate(rq, p, p->sched_remote_wakeup ? WF_MIGRATED : 0, &rf);
 	}
 
+	/*
+	 * Must be after enqueueing at least once task such that
+	 * idle_cpu() does not observe a false-negative -- if it does,
+	 * it is possible for select_idle_siblings() to stack a number
+	 * of tasks on this CPU during that window.
+	 *
+	 * It is ok to clear ttwu_pending when another task pending.
+	 * We will receive IPI after local irq enabled and then enqueue it.
+	 * Since now nr_running > 0, idle_cpu() will always get correct result.
+	 */
+	WRITE_ONCE(rq->ttwu_pending, 0);
 	rq_unlock_irqrestore(rq, &rf);
 }
 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ