[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9119934f-fb61-3b55-655c-9a7552e0b30b@gmail.com>
Date: Sun, 25 Jul 2021 14:19:52 +0800
From: Desmond Cheong Zhi Xi <desmondcheongzx@...il.com>
To: dsterba@...e.cz, clm@...com, josef@...icpanda.com,
dsterba@...e.com, anand.jain@...cle.com,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
skhan@...uxfoundation.org, gregkh@...uxfoundation.org,
linux-kernel-mentees@...ts.linuxfoundation.org,
syzbot+a70e2ad0879f160b9217@...kaller.appspotmail.com
Subject: Re: [PATCH] btrfs: fix rw device counting in
__btrfs_free_extra_devids
On 22/7/21 1:59 am, David Sterba wrote:
> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a warning in close_fs_devices that happens because
>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>> on each device.
>>
>> This happens when a writeable device is removed in
>> __btrfs_free_extra_devids, but the rw device count is not decremented
>> accordingly. So when close_fs_devices is called, the removed device is
>> still counted and we get an off by 1 error.
>>
>> Here is one call trace that was observed:
>> btrfs_mount_root():
>> btrfs_scan_one_device():
>> device_list_add(); <---------------- device added
>> btrfs_open_devices():
>> open_fs_devices():
>> btrfs_open_one_device(); <-------- rw device count ++
>> btrfs_fill_super():
>> open_ctree():
>> btrfs_free_extra_devids():
>> __btrfs_free_extra_devids(); <--- device removed
>> fail_tree_roots:
>> btrfs_close_devices():
>> close_fs_devices(); <------- rw device count off by 1
>>
>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
>
> What this patch did in the last hunk was the rw_devices decrement, but
> conditional:
>
> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
> list_del_init(&device->dev_alloc_list);
> clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> - if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> - &device->dev_state))
> - fs_devices->rw_devices--;
> }
> list_del_init(&device->dev_list);
> fs_devices->num_devices--;
> ---
>
>
>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>> list_del_init(&device->dev_alloc_list);
>> clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>> + fs_devices->rw_devices--;
>> }
>> list_del_init(&device->dev_list);
>> fs_devices->num_devices--;
>
> So should it be reinstated in the original form? The rest of
> cf89af146b7e handles unexpected device replace item during mount.
>
> Adding the decrement is correct, but right now I'm not sure about the
> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
> The state machine of the device bits and counters is not trivial so
> fixing it one way or the other could lead to further syzbot reports if
> we don't understand the issue.
>
Hi David,
Thanks for raising this issue. I took a closer look and I think we don't
have to reinstate the original form because it's a historical artifact.
The short version of the story is that going by the intention of
__btrfs_free_extra_devids, we skip removing the replace target device.
Hence, by the time we've reached the decrement in question, the device
is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit
should not be set.
But we should also try to understand the original intention of the code.
The check in question was first introduced in commit 8dabb7420f01
("Btrfs: change core code of btrfs to support the device replace
operations"):
> @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
> if (device->writeable) {
> list_del_init(&device->dev_alloc_list);
> device->writeable = 0;
> - fs_devices->rw_devices--;
> + if (!device->is_tgtdev_for_dev_replace)
> + fs_devices->rw_devices--;
> }
> list_del_init(&device->dev_list);
> fs_devices->num_devices--;
If we take a trip back in time to this commit we see that
btrfs_dev_replace_finishing added the target device to the alloc list
without incrementing the rw_devices count. So this check was likely
originally meant to prevent under-counting of rw_devices.
However, the situation has changed, following various fixes to
rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss
match after seed replace") added an increment to rw_devices when
replacing a seed device with a writable one in btrfs_dev_replace_finishing:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index eea26e1b2fda..fb0a7fa2f70c 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> if (fs_info->fs_devices->latest_bdev == src_device->bdev)
> fs_info->fs_devices->latest_bdev = tgt_device->bdev;
> list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> + if (src_device->fs_devices->seeding)
> + fs_info->fs_devices->rw_devices++;
>
> /* replace the sysfs entry */
> btrfs_kobj_rm_device(fs_info, src_device);
This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic
of source device removing more clear") that simply decremented
rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was
writable. This meant that the rw_devices count could be incremented in
btrfs_dev_replace_finishing without any checks:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index e9cbbdb72978..6f662b34ba0e 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
> if (fs_info->fs_devices->latest_bdev == src_device->bdev)
> fs_info->fs_devices->latest_bdev = tgt_device->bdev;
> list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> - if (src_device->fs_devices->seeding)
> - fs_info->fs_devices->rw_devices++;
> + fs_info->fs_devices->rw_devices++;
>
> /* replace the sysfs entry */
> btrfs_kobj_rm_device(fs_info, src_device);
Thus, given the current state of the code base, the original check is
now incorrect, because we want to decrement rw_devices as long as the
device is being removed from the alloc list.
To further convince ourselves of this, we can take a closer look at the
relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the
BTRFS_DEV_STATE_REPLACE_TGT bit for devices.
BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
- btrfs_init_dev_replace_tgtdev
- btrfs_init_dev_replace
In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is
set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.
In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set
for the target device found with devid BTRFS_DEV_REPLACE_DEVID.
From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set
only for the device with devid BTRFS_DEV_REPLACE_DEVID.
It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID,
then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.
With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't
have replace item with target device"), we skip removing the device in
__btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
> - if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
> - /*
> - * In the first step, keep the device which has
> - * the correct fsid and the devid that is used
> - * for the dev_replace procedure.
> - * In the second step, the dev_replace state is
> - * read from the device tree and it is known
> - * whether the procedure is really active or
> - * not, which means whether this device is
> - * used or whether it should be removed.
> - */
> - if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> - &device->dev_state)) {
> - continue;
> - }
> - }
> + /*
> + * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
> + * in btrfs_init_dev_replace() so just continue.
> + */
> + if (device->devid == BTRFS_DEV_REPLACE_DEVID)
> + continue;
Given the discussion above, after we fail the check for device->devid ==
BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace
target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.
So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before
incrementing rw_devices is not just incorrect at this point, it's also
redundant.
Of course, I would hate to introduce a hard-to-find bug with a bad
analysis, so any thoughts on this would be appreciated.
Best wishes,
Desmond
Powered by blists - more mailing lists