lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240925000622.1972325-1-tj@kernel.org>
Date: Tue, 24 Sep 2024 14:06:02 -1000
From: Tejun Heo <tj@...nel.org>
To: void@...ifault.com
Cc: kernel-team@...a.com,
	linux-kernel@...r.kernel.org,
	sched-ext@...a.com
Subject: [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Split %SCX_DSQ_GLOBAL per-node

In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.

This patchset splits the global DSQ per NUMA node. Each node has its own
global DSQ. When a task is dispatched to SCX_DSQ_GLOBAL, it's put into the
global DSQ local to the task's CPU and all CPUs in a node only consume its
node-local global DSQ.

This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.

This patchset contains the following patches:

 0001-scx_flatcg-Use-a-user-DSQ-for-fallback-instead-of-SC.patch
 0002-sched_ext-Allow-only-user-DSQs-for-scx_bpf_consume-s.patch
 0003-sched_ext-Relocate-find_user_dsq.patch
 0004-sched_ext-Split-the-global-DSQ-per-NUMA-node.patch
 0005-sched_ext-Use-shorter-slice-while-bypassing.patch

 0001-0003 are preparations.

 0004 splits %SCX_DSQ_GLOBAL per-node.

 0005 reduces time slice used while bypassing. This can make e.g. unloading
 of the BPF scheduler complete faster under heavy contention.

This patchset can also be found in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-split-global

diffstat follows. Thanks.

 kernel/sched/ext.c               |  109 ++++++++++++++++++++++++++++++++++++++++++-------------------
 tools/sched_ext/scx_flatcg.bpf.c |   17 +++++++--
 2 files changed, 89 insertions(+), 37 deletions(-)

--
tejun

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ