lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 28 Sep 2023 21:11:10 -0000
From:   "tip-bot2 for Joel Fernandes (Google)" <tip-bot2@...utronix.de>
To:     linux-tip-commits@...r.kernel.org
Cc:     "Joel Fernandes (Google)" <joel@...lfernandes.org>,
        Ingo Molnar <mingo@...nel.org>,
        "Paul E. McKenney" <paulmck@...nel.org>, stable@...r.kernel.org,
        x86@...nel.org, linux-kernel@...r.kernel.org
Subject: [tip: sched/urgent] sched/rt: Fix live lock between
 select_fallback_rq() and RT push

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     fc09027786c900368de98d03d40af058bcb01ad9
Gitweb:        https://git.kernel.org/tip/fc09027786c900368de98d03d40af058bcb01ad9
Author:        Joel Fernandes (Google) <joel@...lfernandes.org>
AuthorDate:    Sat, 23 Sep 2023 01:14:08 
Committer:     Ingo Molnar <mingo@...nel.org>
CommitterDate: Thu, 28 Sep 2023 22:58:13 +02:00

sched/rt: Fix live lock between select_fallback_rq() and RT push

During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.

On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running.  During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.

It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.

Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.

With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.

Signed-off-by: Joel Fernandes (Google) <joel@...lfernandes.org>
Signed-off-by: Ingo Molnar <mingo@...nel.org>
Tested-by: Paul E. McKenney <paulmck@...nel.org>
Cc: stable@...r.kernel.org
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
---
 kernel/sched/cpupri.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e72..42c40cf 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 
 	if (lowest_mask) {
 		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ