linux-kernel - [GIT PULL 11/12 for v6.18] writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250926-vfs-writeback-dc8e63496609@brauner>
Date: Fri, 26 Sep 2025 16:19:05 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
	linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [GIT PULL 11/12 for v6.18] writeback

Hey Linus,

/* Summary */
This contains work adressing lockups reported by users when a systemd
unit reading lots of files from a filesystem mounted with the lazytime
mount option exits.

With the lazytime mount option enabled we can be switching many dirty
inodes on cgroup exit to the parent cgroup. The numbers observed in
practice when systemd slice of a large cron job exits can easily reach
hundreds of thousands or millions.

The logic in inode_do_switch_wbs() which sorts the inode into
appropriate place in b_dirty list of the target wb however has linear
complexity in the number of dirty inodes thus overall time complexity of
switching all the inodes is quadratic leading to workers being pegged
for hours consuming 100% of the CPU and switching inodes to the parent wb.

Simple reproducer of the issue:

  FILES=10000
  # Filesystem mounted with lazytime mount option
  MNT=/mnt/
  echo "Creating files and switching timestamps"
  for (( j = 0; j < 50; j ++ )); do
      mkdir $MNT/dir$j
      for (( i = 0; i < $FILES; i++ )); do
          echo "foo" >$MNT/dir$j/file$i
      done
      touch -a -t 202501010000 $MNT/dir$j/file*
  done
  wait
  echo "Syncing and flushing"
  sync
  echo 3 >/proc/sys/vm/drop_caches

  echo "Reading all files from a cgroup"
  mkdir /sys/fs/cgroup/unified/mycg1 || exit
  echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
  for (( j = 0; j < 50; j ++ )); do
      cat /mnt/dir$j/file* >/dev/null &
  done
  wait
  echo "Switching wbs"
  # Now rmdir the cgroup after the script exits

This can be solved by:

* Avoiding contention on the wb->list_lock when switching inodes by
  running a single work item per wb and managing a queue of items
  switching to the wb.

* Allow rescheduling when switching inodes over to a different cgroup to
  avoid softlockups.

* Maintain b_dirty list ordering instead of sorting it.

/* Testing */

gcc (Debian 14.2.0-19) 14.2.0
Debian clang version 19.1.7 (3+b1)

No build failures or warnings were observed.

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 8f5ae30d69d7543eee0d70083daf4de8fe15d585:

  Linux 6.17-rc1 (2025-08-10 19:41:16 +0300)

are available in the Git repository at:

  git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.18-rc1.writeback

for you to fetch changes up to 9426414f0d42f824892ecd4dccfebf8987084a41:

  Merge patch series "writeback: Avoid lockups when switching inodes" (2025-09-19 13:11:06 +0200)

Please consider pulling these changes from the signed vfs-6.18-rc1.writeback tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.18-rc1.writeback

----------------------------------------------------------------
Christian Brauner (1):
      Merge patch series "writeback: Avoid lockups when switching inodes"

Jan Kara (4):
      writeback: Avoid contention on wb->list_lock when switching inodes
      writeback: Avoid softlockup when switching many inodes
      writeback: Avoid excessively long inode switching times
      writeback: Add tracepoint to track pending inode switches

 fs/fs-writeback.c                | 133 +++++++++++++++++++++++++--------------
 include/linux/backing-dev-defs.h |   4 ++
 include/linux/writeback.h        |   2 +
 include/trace/events/writeback.h |  29 +++++++++
 mm/backing-dev.c                 |   5 ++
 5 files changed, 126 insertions(+), 47 deletions(-)