[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <50878e07-3b16-18e5-b4d9-da1cb7a05139@huawei.com>
Date: Sat, 8 Apr 2023 11:19:25 +0800
From: Baokun Li <libaokun1@...wei.com>
To: <linux-fsdevel@...r.kernel.org>
CC: <viro@...iv.linux.org.uk>, <brauner@...nel.org>, <tj@...nel.org>,
<tytso@....edu>, <adilger.kernel@...ger.ca>, <jack@...e.cz>,
<ritesh.list@...il.com>, <linux-ext4@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
<yangerkun@...wei.com>, <yukuai3@...wei.com>,
<stable@...r.kernel.org>, Baokun Li <libaokun1@...wei.com>
Subject: Re: [RFC PATCH] writeback, cgroup: fix null-ptr-deref write in
bdi_split_work_to_wbs
Sorry for the noise, my understanding of this issue is wrong.
Please ignore this patch.
I will send a v2 patch.
On 2023/4/6 22:02, Baokun Li wrote:
> KASAN report null-ptr-deref:
> ==================================================================
> BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x6ce/0x6e0
> Write of size 8 at addr 0000000000000000 by task syz-executor.3/3514
>
> CPU: 3 PID: 3514 Comm: syz-executor.3 Not tainted 5.10.0-dirty #1
> Call Trace:
> dump_stack+0xbe/0xfd
> __kasan_report.cold+0x34/0x84
> kasan_report+0x3a/0x50
> check_memory_region+0xfd/0x1f0
> bdi_split_work_to_wbs+0x6ce/0x6e0
> __writeback_inodes_sb_nr+0x184/0x1f0
> try_to_writeback_inodes_sb+0x7f/0xa0
> ext4_nonda_switch+0x125/0x130
> ext4_da_write_begin+0x126/0x6e0
> generic_perform_write+0x199/0x3a0
> ext4_buffered_write_iter+0x16d/0x2b0
> ext4_file_write_iter+0xea/0x140
> new_sync_write+0x2fa/0x430
> vfs_write+0x4a1/0x570
> ksys_write+0xf6/0x1f0
> do_syscall_64+0x30/0x40
> entry_SYSCALL_64_after_hwframe+0x61/0xc6
> RIP: 0033:0x45513d
> [...]
> ==================================================================
>
> Above issue may happen as follows:
>
> cpu1 cpu2
> ----------------------------|----------------------------
> ext4_nonda_switch
> try_to_writeback_inodes_sb
> __writeback_inodes_sb_nr
> bdi_split_work_to_wbs
> kmalloc(sizeof(*work), GFP_ATOMIC) ---> alloc mem failed
> inode_switch_wbs
> inode_switch_wbs_work_fn
> wb_put_many
> percpu_ref_put_many
> ref->data->release(ref)
> cgwb_release
> &wb->release_work
> cgwb_release_workfn
> percpu_ref_exit
> ref->data = NULL
> kfree(data)
> wb_get(wb)
> percpu_ref_get(&wb->refcnt)
> percpu_ref_get_many(ref, 1)
> atomic_long_add(nr, &ref->data->count) ---> ref->data = NULL
> atomic64_add(i, v) ---> trigger null-ptr-deref
>
> bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs.
> If the allocation of new work fails, the on-stack fallback will be used and
> the reference count of the current wb is increased afterwards. If cgroup
> writeback membership switches occur before getting the reference count and
> the current wb is released as old_wd, then calling wb_get() or wb_put()
> will trigger the null pointer dereference above.
>
> A similar problem is fixed in commit 7fc5854f8c6e ("writeback: synchronize
> sync(2) against cgroup writeback membership switches"), but the patch only
> adds locks to sync_inodes_sb() and not to the __writeback_inodes_sb_nr()
> function that also calls bdi_split_work_to_wbs() function. So avoid the
> above race by adding the same lock to __writeback_inodes_sb_nr() and
> expanding the range of wb_switch_rwsem held in inode_switch_wbs_work_fn().
>
> Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
> Cc: stable@...r.kernel.org
> Signed-off-by: Baokun Li <libaokun1@...wei.com>
> ---
> fs/fs-writeback.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 195dc23e0d83..52825aaf549b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -506,13 +506,13 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
> spin_unlock(&new_wb->list_lock);
> spin_unlock(&old_wb->list_lock);
>
> - up_read(&bdi->wb_switch_rwsem);
> -
> if (nr_switched) {
> wb_wakeup(new_wb);
> wb_put_many(old_wb, nr_switched);
> }
>
> + up_read(&bdi->wb_switch_rwsem);
> +
> for (inodep = isw->inodes; *inodep; inodep++)
> iput(*inodep);
> wb_put(new_wb);
> @@ -936,6 +936,11 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
> * have dirty inodes. If @base_work->nr_page isn't %LONG_MAX, it's
> * distributed to the busy wbs according to each wb's proportion in the
> * total active write bandwidth of @bdi.
> + *
> + * Called under &bdi->wb_switch_rwsem, otherwise bdi_split_work_to_wbs()
> + * may race against cgwb (cgroup writeback) membership switches, which may
> + * cause some inodes to fail to write back, or even trigger a null pointer
> + * dereference using a freed wb.
> */
> static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
> struct wb_writeback_work *base_work,
> @@ -2637,8 +2642,11 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> + /* protect against inode wb switch, see inode_switch_wbs_work_fn() */
> + bdi_down_write_wb_switch_rwsem(bdi);
> bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
> wb_wait_for_completion(&done);
> + bdi_up_write_wb_switch_rwsem(bdi);
> }
>
> /**
--
With Best Regards,
Baokun Li
.
Powered by blists - more mailing lists