[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94178c8f-c9d6-4c3e-9d1d-6d465d379e0f@gmx.com>
Date: Mon, 2 Sep 2024 18:47:01 +0930
From: Qu Wenruo <quwenruo.btrfs@....com>
To: Luca Stefani <luca.stefani.ge1@...il.com>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
David Sterba <dsterba@...e.com>, linux-btrfs@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: Don't block system suspend during fstrim
在 2024/9/2 18:42, Qu Wenruo 写道:
>
>
> 在 2024/9/2 18:30, Luca Stefani 写道:
>>
>>
>> On Mon, 2 Sept 2024 at 10:49, Qu Wenruo <quwenruo.btrfs@....com
>> <mailto:quwenruo.btrfs@....com>> wrote:
>>
>>
>>
>> 在 2024/9/2 18:02, Luca Stefani 写道:
>> > Any update on this? It's not critical but I'd like to know if
>> it's in
>> > some part proper.
>> > Thanks, Luca.
>>
>> Sorry I didn't see your patch in the list, thus sent a different
>> fix for
>> it later:
>>
>>
>> https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t <https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t>
>>
>> >> Sometimes the system isn't able to suspend because
>> >> the task responsible for trimming the device isn't
>> >> able to finish in time.
>> >>
>> >> Since discard isn't a critical call it can be interrupted
>> >> at any time, we can simply report the amount of discarded
>> >> bytes in such cases and stop the trim.
>> >>
>> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>> <https://bugzilla.kernel.org/show_bug.cgi?id=219180>
>> >> Signed-off-by: Luca Stefani <luca.stefani.ge1@...il.com
>> <mailto:luca.stefani.ge1@...il.com>>
>> >> ---
>> >> I have no idea if that's correct, just something I implemented
>> >> looking at the same solution made in ext4 by 5229a658f645.
>> >>
>> >> The patch in itself seems to solve the issue.
>> >>
>> >> repro is as follows:
>> >> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>> >> --verbose --quiet-unsupported &
>> >> sudo ./sleepgraph.py -m mem -rtcwake 5
>> >>
>> >> [836563.289069] PM: suspend exit
>> >> [836563.909298] PM: suspend entry (s2idle)
>> >> [836563.935447] Filesystems sync: 0.026 seconds
>> >> [836563.951391] Freezing user space processes
>> >> [836583.958957] Freezing user space processes failed after 20.007
>> >> seconds (1 tasks refusing to freeze, wq_busy=0):
>> >> [836583.959582] task:fstrim state:D stack:0
>> pid:241865
>> >> tgid:241865 ppid:241864 flags:0x00004006
>> >> [836583.959592] Call Trace:
>> >> [836583.959595] <TASK>
>> >> [836583.959600] __schedule+0x400/0x1720
>> >> [836583.959612] ? mod_delayed_work_on+0xa4/0xb0
>> >> [836583.959622] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959628] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959631] ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>> >> [836583.959640] schedule+0x27/0xf0
>> >> [836583.959644] schedule_timeout+0x12f/0x160
>> >> [836583.959652] io_schedule_timeout+0x51/0x70
>> >> [836583.959657] wait_for_completion_io+0x8a/0x160
>> >> [836583.959663] submit_bio_wait+0x60/0x90
>> >> [836583.959671] blkdev_issue_discard+0x91/0x100
>> >> [836583.959680] btrfs_issue_discard+0xc4/0x140
>> >> [836583.959689] btrfs_discard_extent+0x241/0x2a0
>> >> [836583.959695] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959702] do_trimming+0xd2/0x240
>> >> [836583.959712] trim_bitmaps+0x350/0x4c0
>> >> [836583.959723] btrfs_trim_block_group+0xb8/0x110
>> >> [836583.959729] btrfs_trim_fs+0x118/0x440
>> >> [836583.959734] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959738] ? security_capable+0x41/0x70
>> >> [836583.959746] btrfs_ioctl_fitrim+0x113/0x180
>> >> [836583.959752] btrfs_ioctl+0xdaf/0x2670
>> >> [836583.959759] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959763] ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>> >> [836583.959774] __x64_sys_ioctl+0x94/0xd0
>> >> [836583.959782] do_syscall_64+0x82/0x160
>> >> [836583.959790] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959793] ? syscall_exit_to_user_mode+0x72/0x220
>> >> [836583.959799] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959802] ? do_syscall_64+0x8e/0x160
>> >> [836583.959807] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959811] ? do_sys_openat2+0x9c/0xe0
>> >> [836583.959821] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959825] ? syscall_exit_to_user_mode+0x72/0x220
>> >> [836583.959828] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959832] ? do_syscall_64+0x8e/0x160
>> >> [836583.959835] ? syscall_exit_to_user_mode+0x72/0x220
>> >> [836583.959838] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959842] ? do_syscall_64+0x8e/0x160
>> >> [836583.959845] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959849] ? do_syscall_64+0x8e/0x160
>> >> [836583.959851] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959855] ? do_syscall_64+0x8e/0x160
>> >> [836583.959858] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959861] ? do_syscall_64+0x8e/0x160
>> >> [836583.959864] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959868] ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [836583.959873] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> >> [836583.959878] RIP: 0033:0x7f3e4261af2d
>> >> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246
>> ORIG_RAX:
>> >> 0000000000000010
>> >> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>> >> 00007f3e4261af2d
>> >> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>> >> 0000000000000003
>> >> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>> >> 00007ffec002e7f2
>> >> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>> >> 0000562d74daafc0
>> >> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>> >> 0000562d74daad40
>> >> [836583.959967] </TASK>
>> >> ---
>> >> fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>> >> 1 file changed, 16 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> >> index feec49e6f9c8..7e4c1d4f2f7c 100644
>> >> --- a/fs/btrfs/extent-tree.c
>> >> +++ b/fs/btrfs/extent-tree.c
>> >> @@ -16,6 +16,7 @@
>> >> #include <linux/percpu_counter.h>
>> >> #include <linux/lockdep.h>
>> >> #include <linux/crc32c.h>
>> >> +#include <linux/freezer.h>
>> >> #include "ctree.h"
>> >> #include "extent-tree.h"
>> >> #include "transaction.h"
>> >> @@ -6361,6 +6362,11 @@ void btrfs_error_unpin_extent_range(struct
>> >> btrfs_fs_info *fs_info, u64 start, u6
>> >> unpin_extent_range(fs_info, start, end, false);
>> >> }
>> >> +static bool btrfs_trim_interrupted(void)
>> >> +{
>> >> + return fatal_signal_pending(current) || freezing(current);
>> >> +}
>> >> +
>> >> /*
>> >> * It used to be that old block groups would be left around
>> forever.
>> >> * Iterating over them would be enough to trim unused space.
>> Since we
>> >> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>> >> btrfs_device *device, u64 *trimmed)
>> >> start += len;
>> >> *trimmed += bytes;
>> >> - if (fatal_signal_pending(current)) {
>> >> - ret = -ERESTARTSYS;
>> >> + if (btrfs_trim_interrupted()) {
>> >> + ret = 0;
>> >> break;
>>
>> Here we should still return the same error number other than 0, to
>> let
>> the caller know the operation is interrupted, other than finished
>> normally.
>>
>> Here I was following how ext4 did it, my explanation for that was that
>> the kernel
>> may have still discarded some data before the thread was interrupted
>> thus it made
>> sense to report success.
>
> In that case, progress is reported through fstrim_range structure, not
> through the return value.
> Even if we returned some error code, the fstrim_range::len is still
> updated to indicate the progress.
>
> So we need to keep the error code.
>
>>
>>
>> >> }
>> >> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info
>> *fs_info,
>> >> struct fstrim_range *range)
>> >> cache = btrfs_lookup_first_block_group(fs_info,
>> range->start);
>> >> for (; cache; cache = btrfs_next_block_group(cache)) {
>> >> + if (btrfs_trim_interrupted())
>> >> + break;
>> >> +
>>
>> The same here.
>>
>> >> if (cache->start >= range_end) {
>> >> btrfs_put_block_group(cache);
>> >> break;
>> >> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>> >> *fs_info, struct fstrim_range *range)
>> >> mutex_lock(&fs_devices->device_list_mutex);
>> >> list_for_each_entry(device, &fs_devices->devices,
>> dev_list) {
>> >> + if (btrfs_trim_interrupted())
>> >> + break;
>> >> +
>>
>> The same here.
>>
>> Furthermore, I think we may not need the extra checks.
>>
>> The fstrim is based on block groups, and a block group is normally
>> 1GiB,
>> at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block
>> group
>> boundary should be enough to meet the hibernation/suspension timeout.
>>
>> That's probably true, but 10seconds here wasn't enough and forcing the
>> early return in the other cases
>> was also required.
>> I tried the current patch you linked earlier in my testing and that was
>> the conclusion that led to me adding more checks.
>
> My bad, I forgot that we have free extents trimming, which is not
> limited to block group boundary, and that may be the root cause.
>
> So in that case your extra checks are indeed needed.
>
> Just need to change the return value.
>>
>>
>>
>>
>> >> if (test_bit(BTRFS_DEV_STATE_MISSING,
>> &device->dev_state))
>> >> continue;
>> >> ret = btrfs_trim_free_extents(device, &group_trimmed);
>> >> +
>> >> + trimmed += group_trimmed;
>> >> if (ret) {
>> >> dev_failed++;
>> >> dev_ret = ret;
>> >> break;
>> >> }
>> >> -
>> >> - trimmed += group_trimmed;
>>
>> Any special reason moving the code here?
>>
>> Same as not returning errno before in case of interrupt, I checked the
>> code paths and it's still
>> possible to trim some data (group_trimmed != 0) even in case of failure.
>
> Oh, then it's fine.
>
> Except the return code, everything looks fine to me now.
Forgot to mention that, even for error case, we should copy the
fstrim_range structure to the ioctl parameter to indicate any progress
we made.
Thanks,
Qu
>
> Just please update the commit message to explicitly mention that, we
> have a free extent discarding phase, which can trim a lot of unallocated
> space, and there is no limits on the trim size (unlike the block group
> part).
>
> Thanks,
> Qu
>>
>> Thanks,
>> Qu
>>
>> >> }
>> >> mutex_unlock(&fs_devices->device_list_mutex);
>> >
>>
>
Powered by blists - more mailing lists