[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <39398f63-9220-c5ab-04a9-6e5186e1c0da@gmail.com>
Date: Tue, 27 Jul 2021 07:07:29 +0800
From: Desmond Cheong Zhi Xi <desmondcheongzx@...il.com>
To: dsterba@...e.cz, clm@...com, josef@...icpanda.com,
dsterba@...e.com, anand.jain@...cle.com,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
skhan@...uxfoundation.org, gregkh@...uxfoundation.org,
linux-kernel-mentees@...ts.linuxfoundation.org,
syzbot+a70e2ad0879f160b9217@...kaller.appspotmail.com
Subject: Re: [PATCH] btrfs: fix rw device counting in
__btrfs_free_extra_devids
On 27/7/21 1:52 am, David Sterba wrote:
> On Sun, Jul 25, 2021 at 02:19:52PM +0800, Desmond Cheong Zhi Xi wrote:
>> On 22/7/21 1:59 am, David Sterba wrote:
>>> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>>>> Syzbot reports a warning in close_fs_devices that happens because
>>>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>>>> on each device.
>>>>
>>>> This happens when a writeable device is removed in
>>>> __btrfs_free_extra_devids, but the rw device count is not decremented
>>>> accordingly. So when close_fs_devices is called, the removed device is
>>>> still counted and we get an off by 1 error.
>>>>
>>>> Here is one call trace that was observed:
>>>> btrfs_mount_root():
>>>> btrfs_scan_one_device():
>>>> device_list_add(); <---------------- device added
>>>> btrfs_open_devices():
>>>> open_fs_devices():
>>>> btrfs_open_one_device(); <-------- rw device count ++
>>>> btrfs_fill_super():
>>>> open_ctree():
>>>> btrfs_free_extra_devids():
>>>> __btrfs_free_extra_devids(); <--- device removed
>>>> fail_tree_roots:
>>>> btrfs_close_devices():
>>>> close_fs_devices(); <------- rw device count off by 1
>>>>
>>>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
>>>
>>> What this patch did in the last hunk was the rw_devices decrement, but
>>> conditional:
>>>
>>> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>> list_del_init(&device->dev_alloc_list);
>>> clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>>> - if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
>>> - &device->dev_state))
>>> - fs_devices->rw_devices--;
>>> }
>>> list_del_init(&device->dev_list);
>>> fs_devices->num_devices--;
>>> ---
>>>
>>>
>>>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>>> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>>> list_del_init(&device->dev_alloc_list);
>>>> clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>>>> + fs_devices->rw_devices--;
>>>> }
>>>> list_del_init(&device->dev_list);
>>>> fs_devices->num_devices--;
>>>
>>> So should it be reinstated in the original form? The rest of
>>> cf89af146b7e handles unexpected device replace item during mount.
>>>
>>> Adding the decrement is correct, but right now I'm not sure about the
>>> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
>>> The state machine of the device bits and counters is not trivial so
>>> fixing it one way or the other could lead to further syzbot reports if
>>> we don't understand the issue.
>>>
>>
>> Hi David,
>>
>> Thanks for raising this issue. I took a closer look and I think we don't
>> have to reinstate the original form because it's a historical artifact.
>>
>> The short version of the story is that going by the intention of
>> __btrfs_free_extra_devids, we skip removing the replace target device.
>> Hence, by the time we've reached the decrement in question, the device
>> is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit
>> should not be set.
>>
>> But we should also try to understand the original intention of the code.
>> The check in question was first introduced in commit 8dabb7420f01
>> ("Btrfs: change core code of btrfs to support the device replace
>> operations"):
>>> @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
>>> if (device->writeable) {
>>> list_del_init(&device->dev_alloc_list);
>>> device->writeable = 0;
>>> - fs_devices->rw_devices--;
>>> + if (!device->is_tgtdev_for_dev_replace)
>>> + fs_devices->rw_devices--;
>>> }
>>> list_del_init(&device->dev_list);
>>> fs_devices->num_devices--;
>>
>> If we take a trip back in time to this commit we see that
>> btrfs_dev_replace_finishing added the target device to the alloc list
>> without incrementing the rw_devices count. So this check was likely
>> originally meant to prevent under-counting of rw_devices.
>>
>> However, the situation has changed, following various fixes to
>> rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss
>> match after seed replace") added an increment to rw_devices when
>> replacing a seed device with a writable one in btrfs_dev_replace_finishing:
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index eea26e1b2fda..fb0a7fa2f70c 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>> if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>>> fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>>> list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
>>> + if (src_device->fs_devices->seeding)
>>> + fs_info->fs_devices->rw_devices++;
>>>
>>> /* replace the sysfs entry */
>>> btrfs_kobj_rm_device(fs_info, src_device);
>>
>> This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic
>> of source device removing more clear") that simply decremented
>> rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was
>> writable. This meant that the rw_devices count could be incremented in
>> btrfs_dev_replace_finishing without any checks:
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index e9cbbdb72978..6f662b34ba0e 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>>> if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>>> fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>>> list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
>>> - if (src_device->fs_devices->seeding)
>>> - fs_info->fs_devices->rw_devices++;
>>> + fs_info->fs_devices->rw_devices++;
>>>
>>> /* replace the sysfs entry */
>>> btrfs_kobj_rm_device(fs_info, src_device);
>>
>> Thus, given the current state of the code base, the original check is
>> now incorrect, because we want to decrement rw_devices as long as the
>> device is being removed from the alloc list.
>>
>> To further convince ourselves of this, we can take a closer look at the
>> relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the
>> BTRFS_DEV_STATE_REPLACE_TGT bit for devices.
>>
>> BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
>> - btrfs_init_dev_replace_tgtdev
>> - btrfs_init_dev_replace
>>
>> In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is
>> set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.
>>
>> In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set
>> for the target device found with devid BTRFS_DEV_REPLACE_DEVID.
>>
>> From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set
>> only for the device with devid BTRFS_DEV_REPLACE_DEVID.
>>
>> It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID,
>> then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.
>>
>> With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't
>> have replace item with target device"), we skip removing the device in
>> __btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
>>> - if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
>>> - /*
>>> - * In the first step, keep the device which has
>>> - * the correct fsid and the devid that is used
>>> - * for the dev_replace procedure.
>>> - * In the second step, the dev_replace state is
>>> - * read from the device tree and it is known
>>> - * whether the procedure is really active or
>>> - * not, which means whether this device is
>>> - * used or whether it should be removed.
>>> - */
>>> - if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
>>> - &device->dev_state)) {
>>> - continue;
>>> - }
>>> - }
>>> + /*
>>> + * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
>>> + * in btrfs_init_dev_replace() so just continue.
>>> + */
>>> + if (device->devid == BTRFS_DEV_REPLACE_DEVID)
>>> + continue;
>>
>> Given the discussion above, after we fail the check for device->devid ==
>> BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace
>> target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.
>>
>> So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before
>> incrementing rw_devices is not just incorrect at this point, it's also
>> redundant.
>
> Could you please write some condensed version of the above and resend?
> The original changelog says what happends and how, the analysis here
> is the actual explanation and I'd like to have that recorded. Thanks.
>
Sure thing, I'll prepare a v2 with an updated commit message. Thanks for
the feedback, David.
Powered by blists - more mailing lists