lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240403085245.335072-1-chenridong@huawei.com>
Date: Wed, 3 Apr 2024 08:52:45 +0000
From: Chen Ridong <chenridong@...wei.com>
To: <longman@...hat.com>, <lizefan.x@...edance.com>, <tj@...nel.org>,
	<hannes@...xchg.org>, <daniel.m.jordan@...cle.com>
CC: <peterz@...radead.org>, <cgroups@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <wangweiyang2@...wei.com>,
	<lujialin4@...wei.com>, <xiujianfeng@...wei.com>, <caixinchen1@...wei.com>,
	<chenridong@...wei.com>
Subject: [PATCH] cpuset: fix race between rebuild scheduler domains and hotplug work

When offlining cpus, it holds cpu_hotplug_lock and call
cpuset_hotplug_workfn asynchronously, which holds and releases
cpuset_mutex repeatly to update cpusets, and it will release
cpu_hotplug_lock before cpuset_hotplug_workfn finish. It means that some
interfaces like cpuset_write_resmask holding two locks may rebuild
scheduler domains when some cpusets are not refreshed, which may lead to
generate domains with offlining cpus and will panic.

As commit 406100f3da08 ("cpuset: fix race between hotplug work and later
 CPU offline")  mentioned. This problem happen in cgroup v2:

This problem can also happen in cgroup v1 pressure test, which onlines
and offlines cpus, and sets cpuset.cpus to rebuild domains with
sched_load_balance off.

smpboot: CPU 1 is now offline
BUG: unable to handle page fault for address: 0000342bad598668
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 54 PID: 9966 Comm: sh Kdump: loaded Tainted: G S         OE      6.8.0-rc7+ #5
Hardware name: Powerleader PR2715P3/T1DM-E2, BIOS OEM57.22 11/23/2021
RIP: 0010:sd_init+0x204/0x390
Code: 00 02 00 00 a8 80 0f 84 ad 00 00 00 41 c7 44 24 2c 6e 00 00 00 85 d2 74 2f 48 63 54 24 18 49 8b 45 20 48 8b 14 d5 60 0b 79 ab <48> 8b 04 02 49 89 84 24 10 01 00 00 f0 ff 00 49 8b 84 24 10 01 00
RSP: 0018:ffff9f90669afc18 EFLAGS: 00010206
RAX: 0000342c00e62668 RBX: 0000000000000000 RCX: 00000000ffffffa0
RDX: ffffffffac736000 RSI: 0000000000000060 RDI: ffff8b24b8143930
RBP: ffff8b24b8143920 R08: 0000000000000060 R09: ffff8b2635783670
R10: 0000000000000001 R11: 0000000000000001 R12: ffff8b24b8143800
R13: ffff8ae687511600 R14: ffffffffabc6b9b0 R15: ffff8b25c0040ea0
FS:  00007fd93c734740(0000) GS:ffff8b24bca00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000342bad598668 CR3: 00000001e0934005 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die+0x24/0x70
 ? page_fault_oops+0x82/0x150
 ? local_clock_noinstr+0xf/0xb0
 ? exc_page_fault+0x69/0x1b0
 ? asm_exc_page_fault+0x26/0x30
 ? sd_init+0x204/0x390
 ? sd_init+0x11c/0x390
 build_sched_domains+0x171/0x640
 partition_sched_domains_locked+0x2a1/0x3c0
 rebuild_sched_domains_locked+0x14f/0x200
 update_cpumask+0x27c/0x5f0
 cpuset_write_resmask+0x423/0x530
 kernfs_fop_write_iter+0x160/0x220
 vfs_write+0x355/0x480
 ksys_write+0x69/0xf0
 do_syscall_64+0x66/0x180
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7fd93c4fc907
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff153c7f08 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fd93c4fc907
RDX: 0000000000000002 RSI: 000055d055679900 RDI: 0000000000000001
RBP: 000055d055679900 R08: 00007fd93c5b01e0 R09: 00007fd93c5b0260
R10: 00007fd93c5b0160 R11: 0000000000000246 R12: 0000000000000002
R13: 00007fd93c5f25a0 R14: 0000000000000002 R15: 00007fd93c5f27a0

It must guarantee that cpus in domains passing to
partition_and_rebuild_sched_domains must be active. So the domains should
be checked after generate_sched_domains.

Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Chen Ridong <chenridong@...wei.com>
---
 kernel/cgroup/cpuset.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 927bef3a598a..0e0d469c2591 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1210,6 +1210,7 @@ static void rebuild_sched_domains_locked(void)
 	cpumask_var_t *doms;
 	struct cpuset *cs;
 	int ndoms;
+	int i;
 
 	lockdep_assert_cpus_held();
 	lockdep_assert_held(&cpuset_mutex);
@@ -1251,6 +1252,12 @@ static void rebuild_sched_domains_locked(void)
 	/* Generate domain masks and attrs */
 	ndoms = generate_sched_domains(&doms, &attr);
 
+	/* guarantee no CPU offlining in doms */
+	for (i = 0; i < ndoms; ++i) {
+		if (!cpumask_subset(doms[i], cpu_active_mask))
+			return;
+	}
+
 	/* Have scheduler rebuild the domains */
 	partition_and_rebuild_sched_domains(ndoms, doms, attr);
 }
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ