[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABXGCsMWYaxZry+VDCgP=UM7c9do+JYSKdHAbCcx5=xEwXjE6Q@mail.gmail.com>
Date: Wed, 26 Jun 2024 19:16:52 +0500
From: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
To: Filipe Manana <fdmanana@...nel.org>
Cc: Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Linux regressions mailing list <regressions@...ts.linux.dev>, Btrfs BTRFS <linux-btrfs@...r.kernel.org>,
dsterba@...e.com, josef@...icpanda.com
Subject: Re: 6.10/regression/bisected - after f1d97e769152 I spotted increased
execution time of the kswapd0 process and symptoms as if there is not enough memory
On Wed, Jun 26, 2024 at 3:49 PM Filipe Manana <fdmanana@...nel.org> wrote:
>
> On Tue, Jun 25, 2024 at 10:04 PM Mikhail Gavrilov
> <mikhail.v.gavrilov@...il.com> wrote:
> >
> > Hi,
> > after f1d97e769152 I spotted increased execution time of the kswapd0
> > process and symptoms as if there is not enough memory.
> > Very often I see that kswapd0 consumes 100% CPU [1].
> > Before f1d97e769152 after an hour kswapd0 is working ~3:51 and after
> > three hours ~10:13 time.
> > After f1d97e769152 kswapd0 time increased to ~25:48 after the first
> > hour and three hours it hit 71:01 time.
> > So execution time has increased by 6-7 times.
> >
> > f1d97e76915285013037c487d9513ab763005286 is the first bad commit
> > commit f1d97e76915285013037c487d9513ab763005286 (HEAD)
> > Author: Filipe Manana <fdmanana@...e.com>
> > Date: Fri Mar 22 18:02:59 2024 +0000
> >
> > btrfs: add a global per cpu counter to track number of used extent maps
> >
> > Add a per cpu counter that tracks the total number of extent maps that are
> > in extent trees of inodes that belong to fs trees. This is going to be
> > used in an upcoming change that adds a shrinker for extent maps. Only
> > extent maps for fs trees are considered, because for special trees such as
> > the data relocation tree we don't want to evict their extent maps which
> > are critical for the relocation to work, and since those are limited, it's
> > not a concern to have them in memory during the relocation of a block
> > group. Another case are extent maps for free space cache inodes, which
> > must always remain in memory, but those are limited (there's only one per
> > free space cache inode, which means one per block group).
> >
> > Reviewed-by: Josef Bacik <josef@...icpanda.com>
> > Signed-off-by: Filipe Manana <fdmanana@...e.com>
> > Reviewed-by: David Sterba <dsterba@...e.com>
> > Signed-off-by: David Sterba <dsterba@...e.com>
> >
> > fs/btrfs/disk-io.c | 9 +++++++++
> > fs/btrfs/extent_map.c | 17 +++++++++++++++++
> > fs/btrfs/fs.h | 2 ++
> > 3 files changed, 28 insertions(+)
> >
> > Unfortunately I can't check the revert commit f1d97e769152 because of conflicts.
>
> Yes, because there are follow up commits that depend on it.
>
> I seriously doubt that this is correctly bisected, because that commit
> only adds a counter for tracking the number of extent maps.
> It's using a per cpu counter and I can't think of anything more
> efficient than that.
>
> The commit that adds the extent map shrinker, which is the next commit
> (956a17d9d050761e34ae6f2624e9c1ce456de204), that can
> explain what you are observing.
>
> Now the one you bisected doesn't make sense, not just because it's
> just a counter update but also because you are
> only seeing the kswapd0 slowdown, which is what triggers the shrinker.
git bisect start
# status: waiting for both good and bad commits
# good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
# bad: [1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0] Linux 6.10-rc1
git bisect bad 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0
# bad: [db5d28c0bfe566908719bec8e25443aabecbb802] Merge tag
'drm-next-2024-05-15' of https://gitlab.freedesktop.org/drm/kernel
git bisect bad db5d28c0bfe566908719bec8e25443aabecbb802
6.9.0-01-db5d28c0bfe566908719bec8e25443aabecbb802
up 1:01
root 269 17.4 0.0 0 0 ? R 16:00 10:36 [kswapd0]
up 2:00
root 269 34.5 0.0 0 0 ? S 16:00 41:36 [kswapd0]
up 3:00
root 269 40.2 0.0 0 0 ? R 16:00 72:47 [kswapd0]
BAD
# bad: [b850dc206a57ae272c639e31ac202ec0c2f46960] Merge tag
'firewire-updates-6.10' of
git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394
git bisect bad b850dc206a57ae272c639e31ac202ec0c2f46960
6.9.0-02-b850dc206a57ae272c639e31ac202ec0c2f46960
up 1:00
root 269 25.4 0.0 0 0 ? R 19:09 15:28 [kswapd0]
up 1:18
OOM KILLER
up 2:00
root 269 40.2 0.0 0 0 ? R 19:09 48:18 [kswapd0]
up 3:00
root 269 43.0 0.0 0 0 ? S 19:09 77:38 [kswapd0]
up 3:59
root 269 46.4 0.0 0 0 ? S 19:09 111:09 [kswapd0]
BAD
# good: [59729c8a76544d9d7651287a5d28c5bf7fc9fccc] Merge tag
'tag-chrome-platform-for-v6.10' of
git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
git bisect good 59729c8a76544d9d7651287a5d28c5bf7fc9fccc
6.9.0-03-59729c8a76544d9d7651287a5d28c5bf7fc9fccc+
up 1:00
root 269 9.3 0.0 0 0 ? S 10:08 5:38 [kswapd0]
up 2:02
root 269 8.8 0.0 0 0 ? S 10:08 10:49 [kswapd0]
up 3:00
root 269 8.7 0.0 0 0 ? S 10:08 15:42 [kswapd0]
up 3:56
root 269 8.1 0.0 0 0 ? S 10:08 19:22 [kswapd0]
up 5:00
root 269 7.7 0.0 0 0 ? S 10:08 23:16 [kswapd0]
up 6:00
root 269 7.5 0.0 0 0 ? S 10:08 27:12 [kswapd0]
GOOD
# good: [101b7a97143a018b38b1f7516920a7d7d23d1745] Merge tag
'acpi-6.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 101b7a97143a018b38b1f7516920a7d7d23d1745
6.9.0-04-101b7a97143a018b38b1f7516920a7d7d23d1745
up 1:00
root 269 8.1 0.0 0 0 ? S 17:17 4:53 [kswapd0]
up 2:00
root 269 6.9 0.0 0 0 ? S 17:17 8:19 [kswapd0]
up 3:19
root 269 6.9 0.0 0 0 ? S 17:17 13:57 [kswapd0]
up 4:01
root 269 7.9 0.0 0 0 ? S 17:17 19:08 [kswapd0]
up 5:02
root 269 8.6 0.0 0 0 ? R 17:17 26:16 [kswapd0]
up 6:00
root 269 8.3 0.0 0 0 ? S 17:17 29:59 [kswapd0]
GOOD
# good: [47e9bff7fc042b28eb4cf375f0cf249ab708fdfa] Merge tag
'erofs-for-6.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
git bisect good 47e9bff7fc042b28eb4cf375f0cf249ab708fdfa
6.9.0-05-47e9bff7fc042b28eb4cf375f0cf249ab708fdfa
up 1:00
root 269 8.0 0.0 0 0 ? S 14:00 4:49 [kswapd0]
up 3:00
root 269 7.2 0.0 0 0 ? S 14:00 13:00 [kswapd0]
up 4:00
root 269 7.3 0.0 0 0 ? S 14:00 17:36 [kswapd0]
up 5:08
root 269 6.5 0.0 0 0 ? R 14:00 20:12 [kswapd0]
up 6:00
root 269 6.1 0.0 0 0 ? S 14:00 22:14 [kswapd0]
GOOD
# bad: [b2665fe61d8a51ef70b27e1a830635a72dcc6ad8] Merge tag
'ata-6.10-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
git bisect bad b2665fe61d8a51ef70b27e1a830635a72dcc6ad8
6.9.0-06-b2665fe61d8a51ef70b27e1a830635a72dcc6ad8+
up 1:00
root 269 23.4 0.0 0 0 ? R 20:31 14:06 [kswapd0]
up 2:00
root 269 22.1 0.0 0 0 ? S 20:31 26:36 [kswapd0]
up 3:00
root 269 24.6 0.0 0 0 ? R 20:31 44:21 [kswapd0]
up 4:00
root 269 26.6 0.0 0 0 ? S Jun22 63:57 [kswapd0]
up 5:07
root 269 27.8 0.0 0 0 ? S Jun22 85:35 [kswapd0]
BAD
# bad: [aa5ccf29173acfaa8aa2fdd1421aa6aca1a50cf2] btrfs: handle errors
in btrfs_reloc_clone_csums properly
git bisect bad aa5ccf29173acfaa8aa2fdd1421aa6aca1a50cf2
6.9.0-rc7-07-aa5ccf29173acfaa8aa2fdd1421aa6aca1a50cf2
up 1:00
root 268 24.7 0.0 0 0 ? S Jun23 14:57 [kswapd0]
up 2:00
root 268 45.1 0.0 0 0 ? S Jun23 54:13 [kswapd0]
BAD
# good: [d3fbb00f5e21c6dfaa6e820a21df0c9a3455a028] btrfs: embed
data_ref and tree_ref in btrfs_delayed_ref_node
git bisect good d3fbb00f5e21c6dfaa6e820a21df0c9a3455a028
6.9.0-rc7-08-d3fbb00f5e21c6dfaa6e820a21df0c9a3455a028
up 1:00
root 268 6.3 0.0 0 0 ? S 01:42 3:51 [kswapd0]
up 1:00
root 268 8.1 0.0 0 0 ? S 10:10 4:53 [kswapd0]
up 2:02
root 268 8.3 0.0 0 0 ? S 10:10 10:13 [kswapd0]
up 3:00
root 268 7.6 0.0 0 0 ? S 10:10 13:46 [kswapd0]
up 4:00
root 268 9.1 0.0 0 0 ? S 10:10 21:56 [kswapd0]
GOOD
# good: [5fa8a6baff817c1b427aa7a8bfc1482043be6d58] btrfs: pass the
extent map tree's inode to try_merge_map()
git bisect good 5fa8a6baff817c1b427aa7a8bfc1482043be6d58
6.9.0-rc7-09-5fa8a6baff817c1b427aa7a8bfc1482043be6d58
up 1:10
root 268 5.8 0.0 0 0 ? S 14:15 4:09 [kswapd0]
up 2:09
root 268 5.3 0.0 0 0 ? S 14:15 6:52 [kswapd0]
up 3:09
root 268 4.6 0.0 0 0 ? S 14:15 8:47 [kswapd0]
up 4:04
root 268 4.2 0.0 0 0 ? S 14:15 10:24 [kswapd0]
up 5:00
root 268 3.8 0.0 0 0 ? R 14:15 11:35 [kswapd0]
up 6:06
root 268 3.9 0.0 0 0 ? S 14:15 14:24 [kswapd0]
up 7:03
root 268 3.8 0.0 0 0 ? S 14:15 16:26 [kswapd0]
GOOD
# bad: [9a7b68d32afc4e92909c21e166ad993801236be3] btrfs: report
filemap_fdata<write|wait>_range() error
git bisect bad 9a7b68d32afc4e92909c21e166ad993801236be3
6.9.0-rc7-10-9a7b68d32afc4e92909c21e166ad993801236be3
up 1:00
root 268 32.5 0.0 0 0 ? R 21:35 19:34 [kswapd0]
up 2:00
root 268 46.1 0.0 0 0 ? R 21:35 55:24 [kswapd0]
BAD
# bad: [85d288309ab5463140a2d00b3827262fb14e7db4] btrfs: use
btrfs_get_fs_generation() at try_release_extent_mapping()
git bisect bad 85d288309ab5463140a2d00b3827262fb14e7db4
6.9.0-rc7-11-85d288309ab5463140a2d00b3827262fb14e7db4
up 1:00
root 268 38.0 0.0 0 0 ? S 00:36 22:50 [kswapd0]
up 2:01
root 268 32.7 0.0 0 0 ? R 00:36 39:38 [kswapd0]
up 3:00
root 268 32.1 0.0 0 0 ? S 00:36 58:01 [kswapd0]
BAD
# bad: [65bb9fb00b7012a78b2f5d1cd042bf098900c5d3] btrfs: update
comment for btrfs_set_inode_full_sync() about locking
git bisect bad 65bb9fb00b7012a78b2f5d1cd042bf098900c5d3
6.9.0-rc7-12-65bb9fb00b7012a78b2f5d1cd042bf098900c5d3
up 1:06
root 268 17.3 0.0 0 0 ? S 10:14 11:34 [kswapd0]
up 1:22
OOM KILLER
up 1:32
OOM KILLER
up 2:01
root 268 37.2 0.0 0 0 ? R 10:14 45:07 [kswapd0]
up 3:01
root 268 33.1 0.0 0 0 ? S 10:14 60:12 [kswapd0]
BAD
# bad: [956a17d9d050761e34ae6f2624e9c1ce456de204] btrfs: add a
shrinker for extent maps
git bisect bad 956a17d9d050761e34ae6f2624e9c1ce456de204
6.9.0-rc7-13-956a17d9d050761e34ae6f2624e9c1ce456de204
up 1:01
root 268 42.1 0.0 0 0 ? R 13:20 25:48 [kswapd0]
up 1:30
OOM KILLER
up 2:01
root 268 40.7 0.0 0 0 ? R 13:20 49:27 [kswapd0]
up 2:34
root 268 46.0 0.0 0 0 ? S 13:20 71:01 [kswapd0]
BAD
# bad: [f1d97e76915285013037c487d9513ab763005286] btrfs: add a global
per cpu counter to track number of used extent maps
git bisect bad f1d97e76915285013037c487d9513ab763005286
6.9.0-rc7-14-f1d97e76915285013037c487d9513ab763005286
up 1:06
root 268 15.6 0.0 0 0 ? S 16:15 10:27 [kswapd0]
up 2:00
root 268 12.0 0.0 0 0 ? S 16:15 14:26 [kswapd0]
up 3:00
root 268 9.8 0.0 0 0 ? S 16:15 17:48 [kswapd0]
GOOD!!! But I answered - bad.
Yeah my bad, I made a mistake on the last step.
Right bad commit is 956a17d9d050761e34ae6f2624e9c1ce456de204
Author: Filipe Manana <fdmanana@...e.com>
Date: Mon Apr 15 17:09:26 2024 +0100
btrfs: add a shrinker for extent maps
Extent maps are used either to represent existing file extent items, or to
represent new extents that are going to be written and the respective file
extent items are created when the ordered extent completes.
We currently don't have any limit for how many extent maps we can have,
neither per inode nor globally. Most of the time this not too noticeable
because extent maps are removed in the following situations:
1) When evicting an inode;
2) When releasing folios (pages) through the btrfs_release_folio() address
space operation callback.
However we won't release extent maps in the folio range if the folio is
either dirty or under writeback or if the inode's i_size is less than
or equals to 16M (see try_release_extent_mapping(). This 16M i_size
constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs:
extent_io and extent_state optimizations"), but there's no explanation
about why we have it or why the 16M value.
This means that for buffered IO we can reach an OOM situation due to too
many extent maps if either of the following happens:
1) There's a set of tasks constantly doing IO on many files with a size
not larger than 16M, specially if they keep the files open for very
long periods, therefore preventing inode eviction.
This requires a really high number of such files, and having many non
mergeable extent maps (due to random 4K writes for example) and a
machine with very little memory;
2) There's a set tasks constantly doing random write IO (therefore
creating many non mergeable extent maps) on files and keeping them
open for long periods of time, so inode eviction doesn't happen and
there's always a lot of dirty pages or pages under writeback,
preventing btrfs_release_folio() from releasing the respective extent
maps.
This second case was actually reported in the thread pointed by the Link
tag below, and it requires a very large file under heavy IO and a machine
with very little amount of RAM, which is probably hard to happen in
practice in a real world use case.
However when using direct IO this is not so hard to happen, because the
page cache is not used, and therefore btrfs_release_folio() is never
called. Which means extent maps are dropped only when evicting the inode,
and that means that if we have tasks that keep a file descriptor open and
keep doing IO on a very large file (or files), we can exhaust memory due
to an unbounded amount of extent maps. This is especially easy to happen
if we have a huge file with millions of small extents and their extent
maps are not mergeable (non contiguous offsets and disk locations).
This was reported in that thread with the following fio test:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS=""
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bs=4K
direct=1
numjobs=16
fallocate=none
time_based
runtime=90000
[file1]
size=300G
ioengine=libaio
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
Monitoring the btrfs_extent_map slab while running the test with:
$ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \
/sys/kernel/slab/btrfs_extent_map/total_objects'
Shows the number of active and total extent maps skyrocketing to tens of
millions, and on systems with a short amount of memory it's easy and quick
to get into an OOM situation, as reported in that thread.
So to avoid this issue add a shrinker that will remove extents maps, as
long as they are not pinned, and takes proper care with any concurrent
fsync to avoid missing extents (setting the full sync flag while in the
middle of a fast fsync). This shrinker is triggered through the callbacks
nr_cached_objects and free_cached_objects of struct super_operations.
The shrinker will iterate over all roots and over all inodes of each
root, and keeps track of the last scanned root and inode, so that the
next time it runs, it starts from that root and from the next inode.
This is similar to what xfs does for its inode reclaim (implements those
callbacks, and cycles through inodes by starting from where it ended
last time).
Reviewed-by: Josef Bacik <josef@...icpanda.com>
Signed-off-by: Filipe Manana <fdmanana@...e.com>
Reviewed-by: David Sterba <dsterba@...e.com>
Signed-off-by: David Sterba <dsterba@...e.com>
fs/btrfs/extent_map.c | 160
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/extent_map.h | 1 +
fs/btrfs/fs.h | 2 ++
fs/btrfs/super.c | 17 +++++++++++++++++
4 files changed, 180 insertions(+)
> The shrinker itself can be improved, there's one place where I know it
> might loop too much, and I'll improve that.
Oh, great!
Can I test this patch when it is ready?
--
Best Regards,
Mike Gavrilov.
Powered by blists - more mailing lists