linux-kernel - Re: [PATCH v2] writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <500b0d76-ad73-e671-f6ba-eaa2c85b8b86@huawei.com>
Date:   Thu, 4 May 2023 10:29:25 +0800
From:   Baokun Li <libaokun1@...wei.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
CC:     <linux-fsdevel@...r.kernel.org>, <viro@...iv.linux.org.uk>,
        <brauner@...nel.org>, <jack@...e.cz>, <tj@...nel.org>,
        <dennis@...nel.org>, <adilger.kernel@...ger.ca>,
        <linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>,
        <yangerkun@...wei.com>, <houtao1@...wei.com>,
        <stable@...r.kernel.org>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Baokun Li <libaokun1@...wei.com>
Subject: Re: [PATCH v2] writeback, cgroup: fix null-ptr-deref write in
 bdi_split_work_to_wbs

On 2023/5/3 8:17, Andrew Morton wrote:
> On Mon, 10 Apr 2023 21:08:26 +0800 Baokun Li <libaokun1@...wei.com> wrote:
>
>> KASAN report null-ptr-deref:
>> ==================================================================
>> BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
>> Write of size 8 at addr 0000000000000000 by task sync/943
>> CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
>> Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0x7f/0xc0
>>   print_report+0x2ba/0x340
>>   kasan_report+0xc4/0x120
>>   kasan_check_range+0x1b7/0x2e0
>>   __kasan_check_write+0x24/0x40
>>   bdi_split_work_to_wbs+0x5c5/0x7b0
>>   sync_inodes_sb+0x195/0x630
>>   sync_inodes_one_sb+0x3a/0x50
>>   iterate_supers+0x106/0x1b0
>>   ksys_sync+0x98/0x160
>> [...]
>> ==================================================================
>>
>> The race that causes the above issue is as follows:
>>
>>             cpu1                     cpu2
>> -------------------------|-------------------------
>> inode_switch_wbs
>>   INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
>>   queue_rcu_work(isw_wq, &isw->work)
>>   // queue_work async
>>    inode_switch_wbs_work_fn
>>     wb_put_many(old_wb, nr_switched)
>>      percpu_ref_put_many
>>       ref->data->release(ref)
>>       cgwb_release
>>        queue_work(cgwb_release_wq, &wb->release_work)
>>        // queue_work async
>>         &wb->release_work
>>         cgwb_release_workfn
>>                              ksys_sync
>>                               iterate_supers
>>                                sync_inodes_one_sb
>>                                 sync_inodes_sb
>>                                  bdi_split_work_to_wbs
>>                                   kmalloc(sizeof(*work), GFP_ATOMIC)
>>                                   // alloc memory failed
>>          percpu_ref_exit
>>           ref->data = NULL
>>           kfree(data)
>>                                   wb_get(wb)
>>                                    percpu_ref_get(&wb->refcnt)
>>                                     percpu_ref_get_many(ref, 1)
>>                                      atomic_long_add(nr, &ref->data->count)
>>                                       atomic64_add(i, v)
>>                                       // trigger null-ptr-deref
>>
>> bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all wbs.
>> If the allocation of new work fails, the on-stack fallback will be used and
>> the reference count of the current wb is increased afterwards. If cgroup
>> writeback membership switches occur before getting the reference count and
>> the current wb is released as old_wd, then calling wb_get() or wb_put()
>> will trigger the null pointer dereference above.
>>
>> This issue was introduced in v4.3-rc7 (see fix tag1). Both sync_inodes_sb()
>> and __writeback_inodes_sb_nr() calls to bdi_split_work_to_wbs() can trigger
>> this issue. For scenarios called via sync_inodes_sb(), originally commit
>> 7fc5854f8c6e ("writeback: synchronize sync(2) against cgroup writeback
>> membership switches") reduced the possibility of the issue by adding
>> wb_switch_rwsem, but in v5.14-rc1 (see fix tag2) removed the
>> "inode_io_list_del_locked(inode, old_wb)" from inode_switch_wbs_work_fn()
>> so that wb->state contains WB_has_dirty_io, thus old_wb is not skipped
>> when traversing wbs in bdi_split_work_to_wbs(), and the issue becomes
>> easily reproducible again.
>>
>> To solve this problem, percpu_ref_exit() is called under RCU protection
>> to avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
>> Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
>> and skip the current wb if wb_tryget() fails because the wb has already
>> been shutdown.
>>
>> Fixes: b817525a4a80 ("writeback: bdi_writeback iteration must not skip dying ones")
>> Fixes: f3b6a6df38aa ("writeback, cgroup: keep list of inodes attached to bdi_writeback")
> Cc Roman for this second commit.
Thanks! I forgot to cc to Roman!
>
>> Cc: stable@...r.kernel.org
> Having two Fixes: is awkward.  These serve as a guide to tell -stable
> maintainers which kernels need the fix.  Can we be more precise?
Yes! I'm very sorry about that. As I replied earlier, this issue was 
introduced in v4.3-rc7.
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -978,6 +978,16 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>>   			continue;
>>   		}
>>   
>> +		/*
>> +		 * If wb_tryget fails, the wb has been shutdown, skip it.
>> +		 *
>> +		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
>> +		 * continuing iteration from @wb after dropping and
>> +		 * regrabbing rcu read lock.
>> +		 */
>> +		if (!wb_tryget(wb))
>> +			continue;
>> +
>>   		/* alloc failed, execute synchronously using on-stack fallback */
>>   		work = &fallback_work;
>>   		*work = *base_work;
>> @@ -986,13 +996,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>>   		work->done = &fallback_work_done;
>>   
>>   		wb_queue_work(wb, work);
>> -
>> -		/*
>> -		 * Pin @wb so that it stays on @bdi->wb_list.  This allows
>> -		 * continuing iteration from @wb after dropping and
>> -		 * regrabbing rcu read lock.
>> -		 */
>> -		wb_get(wb);
>>   		last_wb = wb;
>>   
>>   		rcu_read_unlock();
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index ad011308cebe..43b48750b491 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -507,6 +507,15 @@ static LIST_HEAD(offline_cgwbs);
>>   static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
>>   static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
>>   
>> +static void cgwb_free_rcu(struct rcu_head *rcu_head)
>> +{
>> +	struct bdi_writeback *wb = container_of(rcu_head,
>> +			struct bdi_writeback, rcu);
> nit:
>
> 	struct bdi_writeback *wb;
>
> 	wb = container_of(rcu_head, struct bdi_writeback, rcu);
>
> looks nicer, no?

Yes, it looks neater.

>
>> +	percpu_ref_exit(&wb->refcnt);
>> +	kfree(wb);
>> +}
>> +
>>   static void cgwb_release_workfn(struct work_struct *work)
>>   {
>>   	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
>> @@ -529,11 +538,10 @@ static void cgwb_release_workfn(struct work_struct *work)
>>   	list_del(&wb->offline_node);
>>   	spin_unlock_irq(&cgwb_lock);
>>   
>> -	percpu_ref_exit(&wb->refcnt);
>>   	wb_exit(wb);
>>   	bdi_put(bdi);
>>   	WARN_ON_ONCE(!list_empty(&wb->b_attached));
>> -	kfree_rcu(wb, rcu);
>> +	call_rcu(&wb->rcu, cgwb_free_rcu);
>>   }
>>   
>>   static void cgwb_release(struct percpu_ref *refcnt)


Thanks!
-- 
With Best Regards,
Baokun Li
.