[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250926-vfs-writeback-dc8e63496609@brauner>
Date: Fri, 26 Sep 2025 16:19:05 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: [GIT PULL 11/12 for v6.18] writeback
Hey Linus,
/* Summary */
This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.
With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.
The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity of
switching all the inodes is quadratic leading to workers being pegged
for hours consuming 100% of the CPU and switching inodes to the parent wb.
Simple reproducer of the issue:
FILES=10000
# Filesystem mounted with lazytime mount option
MNT=/mnt/
echo "Creating files and switching timestamps"
for (( j = 0; j < 50; j ++ )); do
mkdir $MNT/dir$j
for (( i = 0; i < $FILES; i++ )); do
echo "foo" >$MNT/dir$j/file$i
done
touch -a -t 202501010000 $MNT/dir$j/file*
done
wait
echo "Syncing and flushing"
sync
echo 3 >/proc/sys/vm/drop_caches
echo "Reading all files from a cgroup"
mkdir /sys/fs/cgroup/unified/mycg1 || exit
echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
for (( j = 0; j < 50; j ++ )); do
cat /mnt/dir$j/file* >/dev/null &
done
wait
echo "Switching wbs"
# Now rmdir the cgroup after the script exits
This can be solved by:
* Avoiding contention on the wb->list_lock when switching inodes by
running a single work item per wb and managing a queue of items
switching to the wb.
* Allow rescheduling when switching inodes over to a different cgroup to
avoid softlockups.
* Maintain b_dirty list ordering instead of sorting it.
/* Testing */
gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3+b1)
No build failures or warnings were observed.
/* Conflicts */
Merge conflicts with mainline
=============================
No known conflicts.
Merge conflicts with other trees
================================
No known conflicts.
The following changes since commit 8f5ae30d69d7543eee0d70083daf4de8fe15d585:
Linux 6.17-rc1 (2025-08-10 19:41:16 +0300)
are available in the Git repository at:
git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.18-rc1.writeback
for you to fetch changes up to 9426414f0d42f824892ecd4dccfebf8987084a41:
Merge patch series "writeback: Avoid lockups when switching inodes" (2025-09-19 13:11:06 +0200)
Please consider pulling these changes from the signed vfs-6.18-rc1.writeback tag.
Thanks!
Christian
----------------------------------------------------------------
vfs-6.18-rc1.writeback
----------------------------------------------------------------
Christian Brauner (1):
Merge patch series "writeback: Avoid lockups when switching inodes"
Jan Kara (4):
writeback: Avoid contention on wb->list_lock when switching inodes
writeback: Avoid softlockup when switching many inodes
writeback: Avoid excessively long inode switching times
writeback: Add tracepoint to track pending inode switches
fs/fs-writeback.c | 133 +++++++++++++++++++++++++--------------
include/linux/backing-dev-defs.h | 4 ++
include/linux/writeback.h | 2 +
include/trace/events/writeback.h | 29 +++++++++
mm/backing-dev.c | 5 ++
5 files changed, 126 insertions(+), 47 deletions(-)
Powered by blists - more mailing lists