[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250904074505.1722678-1-zhouchuyi@bytedance.com>
Date: Thu, 4 Sep 2025 15:45:02 +0800
From: Chuyi Zhou <zhouchuyi@...edance.com>
To: tj@...nel.org,
mkoutny@...e.com,
hannes@...xchg.org,
longman@...hat.com
Cc: linux-kernel@...r.kernel.org,
Chuyi Zhou <zhouchuyi@...edance.com>
Subject: [PATCH 0/3] Defer flushing of the cpuset_migrate_mm_wq to task_work
Now in cpuset_attach(), we need to synchronously wait for
flush_workqueue to complete. The execution time of flushing
cpuset_migrate_mm_wq depends on the amount of mm migration initiated by
cpusets at that time. When the cpuset.mems of a cgroup occupying a large
amount of memory is modified, it may trigger extensive mm migration,
causing cpuset_attach() to block on flush_workqueue for an extended period.
cgroup attach operation | someone change cpuset.mems
|
-------------------------------+-------------------------------
__cgroup_procs_write() cpuset_write_resmask()
cgroup_kn_lock_live()
cpuset_attach() cpuset_migrate_mm()
cpuset_post_attach()
flush_workqueue(cpuset_migrate_mm_wq);
This could be dangerous because cpuset_attach() is within the critical
section of cgroup_mutex, which may ultimately cause all cgroup-related
operations in the system to be blocked. We encountered this issue in the
production environment, and it can be easily reproduced locally using the
script below.
[Thu Sep 4 14:51:39 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Sep 4 14:51:39 2025] task:tee state:D stack:0 pid:13330 tgid:13330 ppid:13321 task_flags:0x400100 flags:0x00004000
[Thu Sep 4 14:51:39 2025] Call Trace:
[Thu Sep 4 14:51:39 2025] <TASK>
[Thu Sep 4 14:51:39 2025] __schedule+0xcc1/0x1c60
[Thu Sep 4 14:51:39 2025] ? find_held_lock+0x2d/0xa0
[Thu Sep 4 14:51:39 2025] schedule+0x3e/0xe0
[Thu Sep 4 14:51:39 2025] schedule_preempt_disabled+0x15/0x30
[Thu Sep 4 14:51:39 2025] __mutex_lock+0x928/0x1230
[Thu Sep 4 14:51:39 2025] ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep 4 14:51:39 2025] ? cgroup_kn_lock_live+0x4a/0x240
[Thu Sep 4 14:51:39 2025] cgroup_kn_lock_live+0x4a/0x240
[Thu Sep 4 14:51:39 2025] __cgroup_procs_write+0x38/0x210
[Thu Sep 4 14:51:39 2025] cgroup_procs_write+0x17/0x30
[Thu Sep 4 14:51:39 2025] cgroup_file_write+0xa5/0x260
[Thu Sep 4 14:51:39 2025] kernfs_fop_write_iter+0x13d/0x1e0
[Thu Sep 4 14:51:39 2025] vfs_write+0x310/0x530
[Thu Sep 4 14:51:39 2025] ksys_write+0x6e/0xf0
[Thu Sep 4 14:51:39 2025] do_syscall_64+0x77/0x390
[Thu Sep 4 14:51:39 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
This patchset attempts to defer the flush_workqueue() operation until
returning to userspace using the task_work which is originally proposed by
tejun[1], so that flush happens after cgroup_mutex is dropped. That way we
maintain the operation synchronicity while avoiding bothering anyone else.
[1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883
#!/bin/bash
sudo mkdir -p /sys/fs/cgroup/test
sudo mkdir -p /sys/fs/cgroup/test1
sudo mkdir -p /sys/fs/cgroup/test2
echo 0 > /sys/fs/cgroup/test1/cpuset.mems
echo 1 > /sys/fs/cgroup/test2/cpuset.mems
for i in {1..10}; do
(
pid=$BASHPID
while true; do
echo "Add $pid to test1"
echo "$pid" | sudo tee /sys/fs/cgroup/test1/cgroup.procs >/dev/null
sleep 5
echo "Add $pid to test2"
echo "$pid" | sudo tee /sys/fs/cgroup/test2/cgroup.procs >/dev/null
done
) &
done
echo 0 > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
stress --vm 100 --vm-bytes 2048M --vm-keep &
sleep 30
echo "begin change cpuset.mems"
echo 1 > /sys/fs/cgroup/test/cpuset.mems
Chuyi Zhou (3):
cpuset: Don't always flush cpuset_migrate_mm_wq in
cpuset_write_resmask
cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_work
cgroup: Remove unused cgroup_subsys::post_attach
include/linux/cgroup-defs.h | 1 -
kernel/cgroup/cgroup.c | 4 ----
kernel/cgroup/cpuset.c | 30 +++++++++++++++++++++++++-----
3 files changed, 25 insertions(+), 10 deletions(-)
--
2.20.1
Powered by blists - more mailing lists