[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20220131105233.776255625@linuxfoundation.org>
Date: Mon, 31 Jan 2022 11:54:29 +0100
From: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
To: linux-kernel@...r.kernel.org
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
stable@...r.kernel.org, Qu Wenruo <wqu@...e.com>,
Filipe Manana <fdmanana@...e.com>,
David Sterba <dsterba@...e.com>
Subject: [PATCH 5.16 006/200] btrfs: fix deadlock when reserving space during defrag
From: Filipe Manana <fdmanana@...e.com>
commit 0cb5950f3f3b51a4e8657d106f897f2b913e0586 upstream.
When defragging we can end up collecting a range for defrag that has
already pages under delalloc (dirty), as long as the respective extent
map for their range is not mapped to a hole, a prealloc extent or
the extent map is from an old generation.
Most of the time that is harmless from a functional perspective at
least, however it can result in a deadlock:
1) At defrag_collect_targets() we find an extent map that meets all
requirements but there's delalloc for the range it covers, and we add
its range to list of ranges to defrag;
2) The defrag_collect_targets() function is called at defrag_one_range(),
after it locked a range that overlaps the range of the extent map;
3) At defrag_one_range(), while the range is still locked, we call
defrag_one_locked_target() for the range associated to the extent
map we collected at step 1);
4) Then finally at defrag_one_locked_target() we do a call to
btrfs_delalloc_reserve_space(), which will reserve data and metadata
space. If the space reservations can not be satisfied right away, the
flusher might be kicked in and start flushing delalloc and wait for
the respective ordered extents to complete. If this happens we will
deadlock, because both flushing delalloc and finishing an ordered
extent, requires locking the range in the inode's io tree, which was
already locked at defrag_collect_targets().
So fix this by skipping extent maps for which there's already delalloc.
Fixes: eb793cf857828d ("btrfs: defrag: introduce helper to collect target file extents")
CC: stable@...r.kernel.org # 5.16
Reviewed-by: Qu Wenruo <wqu@...e.com>
Signed-off-by: Filipe Manana <fdmanana@...e.com>
Signed-off-by: David Sterba <dsterba@...e.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
---
fs/btrfs/ioctl.c | 31 ++++++++++++++++++++++++++++++-
1 file changed, 30 insertions(+), 1 deletion(-)
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1188,6 +1188,35 @@ static int defrag_collect_targets(struct
goto next;
/*
+ * Our start offset might be in the middle of an existing extent
+ * map, so take that into account.
+ */
+ range_len = em->len - (cur - em->start);
+ /*
+ * If this range of the extent map is already flagged for delalloc,
+ * skip it, because:
+ *
+ * 1) We could deadlock later, when trying to reserve space for
+ * delalloc, because in case we can't immediately reserve space
+ * the flusher can start delalloc and wait for the respective
+ * ordered extents to complete. The deadlock would happen
+ * because we do the space reservation while holding the range
+ * locked, and starting writeback, or finishing an ordered
+ * extent, requires locking the range;
+ *
+ * 2) If there's delalloc there, it means there's dirty pages for
+ * which writeback has not started yet (we clean the delalloc
+ * flag when starting writeback and after creating an ordered
+ * extent). If we mark pages in an adjacent range for defrag,
+ * then we will have a larger contiguous range for delalloc,
+ * very likely resulting in a larger extent after writeback is
+ * triggered (except in a case of free space fragmentation).
+ */
+ if (test_range_bit(&inode->io_tree, cur, cur + range_len - 1,
+ EXTENT_DELALLOC, 0, NULL))
+ goto next;
+
+ /*
* For do_compress case, we want to compress all valid file
* extents, thus no @extent_thresh or mergeable check.
*/
@@ -1195,7 +1224,7 @@ static int defrag_collect_targets(struct
goto add;
/* Skip too large extent */
- if (em->len >= extent_thresh)
+ if (range_len >= extent_thresh)
goto next;
next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
Powered by blists - more mailing lists