linux-kernel - Re: [PATCH] btrfs: Don't block system suspend during fstrim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <072aba7d-6571-4b22-875d-3409c0eefe40@gmail.com>
Date: Mon, 2 Sep 2024 11:31:09 +0200
From: Luca Stefani <luca.stefani.ge1@...il.com>
To: Qu Wenruo <quwenruo.btrfs@....com>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
 David Sterba <dsterba@...e.com>, linux-btrfs@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] btrfs: Don't block system suspend during fstrim



On 02/09/24 11:17, Qu Wenruo wrote:
> 
> 
> 在 2024/9/2 18:42, Qu Wenruo 写道:
>>
>>
>> 在 2024/9/2 18:30, Luca Stefani 写道:
>>>
>>>
>>> On Mon, 2 Sept 2024 at 10:49, Qu Wenruo <quwenruo.btrfs@....com
>>> <mailto:quwenruo.btrfs@....com>> wrote:
>>>
>>>
>>>
>>>     在 2024/9/2 18:02, Luca Stefani 写道:
>>>      > Any update on this? It's not critical but I'd like to know if 
>>> it's in
>>>      > some part proper.
>>>      > Thanks, Luca.
>>>
>>>     Sorry I didn't see your patch in the list, thus sent a different 
>>> fix for
>>>     it later:
>>>
>>> https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t <https://lore.kernel.org/linux-btrfs/20240830185113.GW25962@twin.jikos.cz/T/#t>
>>>
>>>      >> Sometimes the system isn't able to suspend because
>>>      >> the task responsible for trimming the device isn't
>>>      >> able to finish in time.
>>>      >>
>>>      >> Since discard isn't a critical call it can be interrupted
>>>      >> at any time, we can simply report the amount of discarded
>>>      >> bytes in such cases and stop the trim.
>>>      >>
>>>      >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219180
>>>     <https://bugzilla.kernel.org/show_bug.cgi?id=219180>
>>>      >> Signed-off-by: Luca Stefani <luca.stefani.ge1@...il.com
>>>     <mailto:luca.stefani.ge1@...il.com>>
>>>      >> ---
>>>      >> I have no idea if that's correct, just something I implemented
>>>      >> looking at the same solution made in ext4 by 5229a658f645.
>>>      >>
>>>      >> The patch in itself seems to solve the issue.
>>>      >>
>>>      >> repro is as follows:
>>>      >> sudo /sbin/fstrim --listed-in /etc/fstab:/proc/self/mountinfo
>>>      >> --verbose --quiet-unsupported &
>>>      >> sudo ./sleepgraph.py -m mem -rtcwake 5
>>>      >>
>>>      >> [836563.289069] PM: suspend exit
>>>      >> [836563.909298] PM: suspend entry (s2idle)
>>>      >> [836563.935447] Filesystems sync: 0.026 seconds
>>>      >> [836563.951391] Freezing user space processes
>>>      >> [836583.958957] Freezing user space processes failed after 
>>> 20.007
>>>      >> seconds (1 tasks refusing to freeze, wq_busy=0):
>>>      >> [836583.959582] task:fstrim          state:D stack:0 pid:241865
>>>      >> tgid:241865 ppid:241864 flags:0x00004006
>>>      >> [836583.959592] Call Trace:
>>>      >> [836583.959595]  <TASK>
>>>      >> [836583.959600]  __schedule+0x400/0x1720
>>>      >> [836583.959612]  ? mod_delayed_work_on+0xa4/0xb0
>>>      >> [836583.959622]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959628]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959631]  ? blk_mq_flush_plug_list.part.0+0x1e3/0x610
>>>      >> [836583.959640]  schedule+0x27/0xf0
>>>      >> [836583.959644]  schedule_timeout+0x12f/0x160
>>>      >> [836583.959652]  io_schedule_timeout+0x51/0x70
>>>      >> [836583.959657]  wait_for_completion_io+0x8a/0x160
>>>      >> [836583.959663]  submit_bio_wait+0x60/0x90
>>>      >> [836583.959671]  blkdev_issue_discard+0x91/0x100
>>>      >> [836583.959680]  btrfs_issue_discard+0xc4/0x140
>>>      >> [836583.959689]  btrfs_discard_extent+0x241/0x2a0
>>>      >> [836583.959695]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959702]  do_trimming+0xd2/0x240
>>>      >> [836583.959712]  trim_bitmaps+0x350/0x4c0
>>>      >> [836583.959723]  btrfs_trim_block_group+0xb8/0x110
>>>      >> [836583.959729]  btrfs_trim_fs+0x118/0x440
>>>      >> [836583.959734]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959738]  ? security_capable+0x41/0x70
>>>      >> [836583.959746]  btrfs_ioctl_fitrim+0x113/0x180
>>>      >> [836583.959752]  btrfs_ioctl+0xdaf/0x2670
>>>      >> [836583.959759]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959763]  ? ioctl_has_perm.constprop.0.isra.0+0xd8/0x130
>>>      >> [836583.959774]  __x64_sys_ioctl+0x94/0xd0
>>>      >> [836583.959782]  do_syscall_64+0x82/0x160
>>>      >> [836583.959790]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959793]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959799]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959802]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959807]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959811]  ? do_sys_openat2+0x9c/0xe0
>>>      >> [836583.959821]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959825]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959828]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959832]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959835]  ? syscall_exit_to_user_mode+0x72/0x220
>>>      >> [836583.959838]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959842]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959845]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959849]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959851]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959855]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959858]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959861]  ? do_syscall_64+0x8e/0x160
>>>      >> [836583.959864]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959868]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>      >> [836583.959873]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>      >> [836583.959878] RIP: 0033:0x7f3e4261af2d
>>>      >> [836583.959944] RSP: 002b:00007ffec002f400 EFLAGS: 00000246
>>>     ORIG_RAX:
>>>      >> 0000000000000010
>>>      >> [836583.959950] RAX: ffffffffffffffda RBX: 00007ffec002f570 RCX:
>>>      >> 00007f3e4261af2d
>>>      >> [836583.959952] RDX: 00007ffec002f470 RSI: 00000000c0185879 RDI:
>>>      >> 0000000000000003
>>>      >> [836583.959955] RBP: 00007ffec002f450 R08: 0000562d74da7010 R09:
>>>      >> 00007ffec002e7f2
>>>      >> [836583.959957] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>      >> 0000562d74daafc0
>>>      >> [836583.959960] R13: 0000000000000003 R14: 0000562d74daa970 R15:
>>>      >> 0000562d74daad40
>>>      >> [836583.959967]  </TASK>
>>>      >> ---
>>>      >>   fs/btrfs/extent-tree.c | 20 ++++++++++++++++----
>>>      >>   1 file changed, 16 insertions(+), 4 deletions(-)
>>>      >>
>>>      >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>      >> index feec49e6f9c8..7e4c1d4f2f7c 100644
>>>      >> --- a/fs/btrfs/extent-tree.c
>>>      >> +++ b/fs/btrfs/extent-tree.c
>>>      >> @@ -16,6 +16,7 @@
>>>      >>   #include <linux/percpu_counter.h>
>>>      >>   #include <linux/lockdep.h>
>>>      >>   #include <linux/crc32c.h>
>>>      >> +#include <linux/freezer.h>
>>>      >>   #include "ctree.h"
>>>      >>   #include "extent-tree.h"
>>>      >>   #include "transaction.h"
>>>      >> @@ -6361,6 +6362,11 @@ void 
>>> btrfs_error_unpin_extent_range(struct
>>>      >> btrfs_fs_info *fs_info, u64 start, u6
>>>      >>       unpin_extent_range(fs_info, start, end, false);
>>>      >>   }
>>>      >> +static bool btrfs_trim_interrupted(void)
>>>      >> +{
>>>      >> +    return fatal_signal_pending(current) || freezing(current);
>>>      >> +}
>>>      >> +
>>>      >>   /*
>>>      >>    * It used to be that old block groups would be left around
>>>     forever.
>>>      >>    * Iterating over them would be enough to trim unused space.
>>>     Since we
>>>      >> @@ -6459,8 +6465,8 @@ static int btrfs_trim_free_extents(struct
>>>      >> btrfs_device *device, u64 *trimmed)
>>>      >>           start += len;
>>>      >>           *trimmed += bytes;
>>>      >> -        if (fatal_signal_pending(current)) {
>>>      >> -            ret = -ERESTARTSYS;
>>>      >> +        if (btrfs_trim_interrupted()) {
>>>      >> +            ret = 0;
>>>      >>               break;
>>>
>>>     Here we should still return the same error number other than 0, 
>>> to let
>>>     the caller know the operation is interrupted, other than finished
>>>     normally.
>>>
>>> Here I was following how ext4 did it, my explanation for that was that
>>> the kernel
>>> may have still discarded some data before the thread was interrupted
>>> thus it made
>>> sense to report success.
>>
>> In that case, progress is reported through fstrim_range structure, not
>> through the return value.
>> Even if we returned some error code, the fstrim_range::len is still
>> updated to indicate the progress.
>>
>> So we need to keep the error code.
>>
>>>
>>>
>>>      >>           }
>>>      >> @@ -6508,6 +6514,9 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>>     *fs_info,
>>>      >> struct fstrim_range *range)
>>>      >>       cache = btrfs_lookup_first_block_group(fs_info, 
>>> range->start);
>>>      >>       for (; cache; cache = btrfs_next_block_group(cache)) {
>>>      >> +        if (btrfs_trim_interrupted())
>>>      >> +            break;
>>>      >> +
>>>
>>>     The same here.
>>>
>>>      >>           if (cache->start >= range_end) {
>>>      >>               btrfs_put_block_group(cache);
>>>      >>               break;
>>>      >> @@ -6547,17 +6556,20 @@ int btrfs_trim_fs(struct btrfs_fs_info
>>>      >> *fs_info, struct fstrim_range *range)
>>>      >>       mutex_lock(&fs_devices->device_list_mutex);
>>>      >>       list_for_each_entry(device, &fs_devices->devices, 
>>> dev_list) {
>>>      >> +        if (btrfs_trim_interrupted())
>>>      >> +            break;
>>>      >> +
>>>
>>>     The same here.
>>>
>>>     Furthermore, I think we may not need the extra checks.
>>>
>>>     The fstrim is based on block groups, and a block group is 
>>> normally 1GiB,
>>>     at most 10GiB (for RAID0/5/6/10 only), thus exiting at each block 
>>> group
>>>     boundary should be enough to meet the hibernation/suspension 
>>> timeout.
>>>
>>> That's probably true, but 10seconds here wasn't enough and forcing the
>>> early return in the other cases
>>> was also required.
>>> I tried the current patch you linked earlier in my testing and that was
>>> the conclusion that led to me adding more checks.
>>
>> My bad, I forgot that we have free extents trimming, which is not
>> limited to block group boundary, and that may be the root cause.
>>
>> So in that case your extra checks are indeed needed.
>>
>> Just need to change the return value.
>>>
>>>
>>>
>>>
>>>      >>           if (test_bit(BTRFS_DEV_STATE_MISSING, 
>>> &device->dev_state))
>>>      >>               continue;
>>>      >>           ret = btrfs_trim_free_extents(device, &group_trimmed);
>>>      >> +
>>>      >> +        trimmed += group_trimmed;
>>>      >>           if (ret) {
>>>      >>               dev_failed++;
>>>      >>               dev_ret = ret;
>>>      >>               break;
>>>      >>           }
>>>      >> -
>>>      >> -        trimmed += group_trimmed;
>>>
>>>     Any special reason moving the code here?
>>>
>>> Same as not returning errno before in case of interrupt, I checked the
>>> code paths and it's still
>>> possible to trim some data (group_trimmed != 0) even in case of failure.
>>
>> Oh, then it's fine.
>>
>> Except the return code, everything looks fine to me now.
> 
> Forgot to mention that, even for error case, we should copy the 
> fstrim_range structure to the ioctl parameter to indicate any progress 
> we made.
This seems to be already the case.
range->len = trimmed; is always executed regardless of previous failures
and there doesn't seem to be any early return.

Will try adding back the errno and try the repro.

Thanks.
> 
> Thanks,
> Qu
>>
>> Just please update the commit message to explicitly mention that, we
>> have a free extent discarding phase, which can trim a lot of unallocated
>> space, and there is no limits on the trim size (unlike the block group
>> part).
>>
>> Thanks,
>> Qu
>>>
>>>     Thanks,
>>>     Qu
>>>
>>>      >>       }
>>>      >>       mutex_unlock(&fs_devices->device_list_mutex);
>>>      >
>>>
>>