linux-kernel - Re: [PATCH] blk-throttle: fix throtl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <710840cb-4e94-5914-3b88-f3057f3b88d8@huaweicloud.com>
Date: Wed, 17 Sep 2025 15:38:15 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yi Zhang <yi.zhang@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: tj@...nel.org, josef@...icpanda.com, axboe@...nel.dk,
 liangjie@...iang.com, hanguangjiang@...iang.com, cgroups@...r.kernel.org,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] blk-throttle: fix throtl_data leak during disk release

Hi,

在 2025/09/16 22:29, Yi Zhang 写道:
> On Tue, Sep 16, 2025 at 5:05 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2025/09/16 16:15, Yi Zhang 写道:
>>> Hi Yu
>>> A new issue was triggered with the change. Please check it. Thanks.
>>>
>>> [  285.804104] run blktests throtl/001 at 2025-09-16 04:11:12
>>> [  286.161894] null_blk: disk dev_nullb created
>>>
>>> [  293.388583] ======================================================
>>> [  293.394762] WARNING: possible circular locking dependency detected
>>> [  293.400940] 6.17.0-rc6.v1+ #2 Not tainted
>>> [  293.404952] ------------------------------------------------------
>>> [  293.411131] find/1609 is trying to acquire lock:
>>> [  293.415751] ffff8882911b50b0 ((&sq->pending_timer)){+.-.}-{0:0},
>>> at: __timer_delete_sync+0x23/0x120
>>> [  293.424817]
>>>                  but task is already holding lock:
>>> [  293.430648] ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
>>> blkcg_deactivate_policy+0x1e7/0x4e0
>>> [  293.439445]
>>>                  which lock already depends on the new lock.
>>>
>>> [  293.447619]
>>>                  the existing dependency chain (in reverse order) is:
>>> [  293.455096]
>>>                  -> #2 (&blkcg->lock){....}-{3:3}:
>>> [  293.460948]        __lock_acquire+0x57c/0xbd0
>>> [  293.465315]        lock_acquire.part.0+0xbd/0x260
>>> [  293.470020]        _raw_spin_lock+0x37/0x80
>>> [  293.474214]        blkg_create+0x3e2/0x1060
>>> [  293.478401]        blkcg_init_disk+0x8f/0x460
>>> [  293.482769]        __alloc_disk_node+0x27f/0x600
>>> [  293.487397]        __blk_mq_alloc_disk+0x5f/0xd0
>>> [  293.492025]        nvme_alloc_ns+0x202/0x17a0 [nvme_core]
>>> [  293.497458]        nvme_scan_ns+0x30b/0x380 [nvme_core]
>>> [  293.502709]        async_run_entry_fn+0x9a/0x4f0
>>> [  293.507330]        process_one_work+0xd8b/0x1320
>>> [  293.511956]        worker_thread+0x5f3/0xfe0
>>> [  293.516231]        kthread+0x3b4/0x770
>>> [  293.519992]        ret_from_fork+0x393/0x480
>>> [  293.524272]        ret_from_fork_asm+0x1a/0x30
>>> [  293.528728]
>>>                  -> #1 (&q->queue_lock){..-.}-{3:3}:
>>> [  293.534749]        __lock_acquire+0x57c/0xbd0
>>> [  293.539108]        lock_acquire.part.0+0xbd/0x260
>>> [  293.543814]        _raw_spin_lock_irq+0x3f/0x90
>>> [  293.548348]        throtl_pending_timer_fn+0x11c/0x15b0
>>> [  293.553581]        call_timer_fn+0x19c/0x3e0
>>> [  293.557853]        __run_timers+0x627/0x9f0
>>> [  293.562041]        run_timer_base+0xe6/0x140
>>> [  293.566312]        run_timer_softirq+0x1a/0x30
>>> [  293.570758]        handle_softirqs+0x1fd/0x890
>>> [  293.575205]        __irq_exit_rcu+0xfd/0x250
>>> [  293.579477]        irq_exit_rcu+0xe/0x30
>>> [  293.583402]        sysvec_apic_timer_interrupt+0xa1/0xd0
>>> [  293.588717]        asm_sysvec_apic_timer_interrupt+0x1a/0x20
>>> [  293.594383]        cpuidle_enter_state+0xf5/0x2f0
>>> [  293.599090]        cpuidle_enter+0x4e/0xa0
>>> [  293.603197]        cpuidle_idle_call+0x213/0x370
>>> [  293.607816]        do_idle+0x131/0x200
>>> [  293.611568]        cpu_startup_entry+0x54/0x60
>>> [  293.616017]        start_secondary+0x20d/0x290
>>> [  293.620471]        common_startup_64+0x13e/0x141
>>> [  293.625096]
>>>                  -> #0 ((&sq->pending_timer)){+.-.}-{0:0}:
>>> [  293.631642]        check_prev_add+0xf1/0xcd0
>>> [  293.635921]        validate_chain+0x487/0x570
>>> [  293.640281]        __lock_acquire+0x57c/0xbd0
>>> [  293.644641]        lock_acquire.part.0+0xbd/0x260
>>> [  293.649345]        __timer_delete_sync+0x40/0x120
>>> [  293.654052]        throtl_pd_free+0x19/0x40
>>> [  293.658238]        blkcg_deactivate_policy+0x2c9/0x4e0
>>> [  293.663378]        blk_throtl_exit+0xa5/0x100
>>> [  293.667743]        blkcg_exit_disk+0x1f/0x270
>>> [  293.672104]        disk_release+0x11b/0x3a0
>>> [  293.676299]        device_release+0x9f/0x210
>>> [  293.680579]        kobject_cleanup+0x105/0x360
>>> [  293.685027]        null_del_dev.part.0+0x1e5/0x480 [null_blk]
>>> [  293.690788]        nullb_group_drop_item+0xa5/0xd0 [null_blk]
>>> [  293.696544]        configfs_rmdir+0x69f/0xac0
>>> [  293.700910]        vfs_rmdir+0x1a5/0x5b0
>>> [  293.704836]        do_rmdir+0x276/0x330
>>> [  293.708677]        __x64_sys_unlinkat+0x16b/0x1e0
>>> [  293.713393]        do_syscall_64+0x94/0x8d0
>>> [  293.717584]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> [  293.723160]
>>>                  other info that might help us debug this:
>>>
>>> [  293.731159] Chain exists of:
>>>                    (&sq->pending_timer) --> &q->queue_lock --> &blkcg->lock
>>>
>>> [  293.742045]  Possible unsafe locking scenario:
>>>
>>> [  293.747972]        CPU0                    CPU1
>>> [  293.752511]        ----                    ----
>>> [  293.757043]   lock(&blkcg->lock);
>>> [  293.760371]                                lock(&q->queue_lock);
>>> [  293.766387]                                lock(&blkcg->lock);
>>> [  293.772226]   lock((&sq->pending_timer));
>>> [  293.776248]
>>
>> This is true deadlock, however, I think the real problem here is
>> pd_free_fn() can be called inside queue_lock, and blk-throttle is
>> protecting with this queue_lock. And this problem should already
>> exist even before this patch, for example, from blkcg_activate_plicy()
>> error patch, pd_free_fn() is called with queue_lock held as well.
>>
>> We can fix this by moving all the pd_free_fn() outside of queue_lock,
>> however, I think we should fix this in blk-throttle, by using internal
>> lock instead of reusing queue_lock.
>>
>> Yi, althrogh I already tested, can you give following patch a test on
>> the top of this patch as well?
> 
> The issue cannot be reproduced now.
> 
> Tested-by: Yi Zhang <yi.zhang@...hat.com>
> 

Turns out, don't hold queue_lock from timer is not enough, because
queue_lock can be grabbed from hardirq context, and timer can be
interrupted, hence deadlock is still possible.

AFAIK, there is no easy fix for this long term deadlock lock for now,
so I'll just send an easy fix for the memmory leak problem, and leave
this deadlock problem later.

Thanks,
Kuai

>>
>> Thanks,
>> Kuai
>>
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index f510ae072868..aaf0aa7756bf 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -33,6 +33,8 @@ static struct workqueue_struct *kthrotld_workqueue;
>>
>>    struct throtl_data
>>    {
>> +       spinlock_t lock;
>> +
>>           /* service tree for active throtl groups */
>>           struct throtl_service_queue service_queue;
>>
>> @@ -1140,7 +1142,7 @@ static void throtl_pending_timer_fn(struct
>> timer_list *t)
>>           else
>>                   q = td->queue;
>>
>> -       spin_lock_irq(&q->queue_lock);
>> +       spin_lock_irq(&td->lock);
>>
>>           if (!q->root_blkg)
>>                   goto out_unlock;
>> @@ -1166,9 +1168,9 @@ static void throtl_pending_timer_fn(struct
>> timer_list *t)
>>                           break;
>>
>>                   /* this dispatch windows is still open, relax and repeat */
>> -               spin_unlock_irq(&q->queue_lock);
>> +               spin_unlock_irq(&td->lock);
>>                   cpu_relax();
>> -               spin_lock_irq(&q->queue_lock);
>> +               spin_lock_irq(&td->lock);
>>           }
>>
>>           if (!dispatched)
>> @@ -1191,7 +1193,7 @@ static void throtl_pending_timer_fn(struct
>> timer_list *t)
>>                   queue_work(kthrotld_workqueue, &td->dispatch_work);
>>           }
>>    out_unlock:
>> -       spin_unlock_irq(&q->queue_lock);
>> +       spin_unlock_irq(&td->lock);
>>    }
>>
>>    /**
>> @@ -1207,7 +1209,6 @@ static void blk_throtl_dispatch_work_fn(struct
>> work_struct *work)
>>           struct throtl_data *td = container_of(work, struct throtl_data,
>>                                                 dispatch_work);
>>           struct throtl_service_queue *td_sq = &td->service_queue;
>> -       struct request_queue *q = td->queue;
>>           struct bio_list bio_list_on_stack;
>>           struct bio *bio;
>>           struct blk_plug plug;
>> @@ -1215,11 +1216,11 @@ static void blk_throtl_dispatch_work_fn(struct
>> work_struct *work)
>>
>>           bio_list_init(&bio_list_on_stack);
>>
>> -       spin_lock_irq(&q->queue_lock);
>> +       spin_lock_irq(&td->lock);
>>           for (rw = READ; rw <= WRITE; rw++)
>>                   while ((bio = throtl_pop_queued(td_sq, NULL, rw)))
>>                           bio_list_add(&bio_list_on_stack, bio);
>> -       spin_unlock_irq(&q->queue_lock);
>> +       spin_unlock_irq(&td->lock);
>>
>>           if (!bio_list_empty(&bio_list_on_stack)) {
>>                   blk_start_plug(&plug);
>> @@ -1324,6 +1325,7 @@ static int blk_throtl_init(struct gendisk *disk)
>>           if (!td)
>>                   return -ENOMEM;
>>
>> +       spin_lock_init(&td->lock);
>>           INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
>>           throtl_service_queue_init(&td->service_queue);
>>
>> @@ -1694,7 +1696,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
>>           if (!blk_throtl_activated(q))
>>                   return;
>>
>> -       spin_lock_irq(&q->queue_lock);
>> +       spin_lock_irq(&q->td->lock);
>>           /*
>>            * queue_lock is held, rcu lock is not needed here technically.
>>            * However, rcu lock is still held to emphasize that following
>> @@ -1713,7 +1715,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
>>                   tg_flush_bios(blkg_to_tg(blkg));
>>           }
>>           rcu_read_unlock();
>> -       spin_unlock_irq(&q->queue_lock);
>> +       spin_unlock_irq(&q->td->lock);
>>    }
>>
>>    static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio,
>> bool rw)
>> @@ -1746,7 +1748,6 @@ static bool tg_within_limit(struct throtl_grp *tg,
>> struct bio *bio, bool rw)
>>
>>    bool __blk_throtl_bio(struct bio *bio)
>>    {
>> -       struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>>           struct blkcg_gq *blkg = bio->bi_blkg;
>>           struct throtl_qnode *qn = NULL;
>>           struct throtl_grp *tg = blkg_to_tg(blkg);
>> @@ -1756,7 +1757,7 @@ bool __blk_throtl_bio(struct bio *bio)
>>           struct throtl_data *td = tg->td;
>>
>>           rcu_read_lock();
>> -       spin_lock_irq(&q->queue_lock);
>> +       spin_lock_irq(&td->lock);
>>           sq = &tg->service_queue;
>>
>>           while (true) {
>> @@ -1832,7 +1833,7 @@ bool __blk_throtl_bio(struct bio *bio)
>>           }
>>
>>    out_unlock:
>> -       spin_unlock_irq(&q->queue_lock);
>> +       spin_unlock_irq(&td->lock);
>>
>>           rcu_read_unlock();
>>           return throttled;
>>
>>>                   *** DEADLOCK ***
>>>
>>> [  293.782166] 8 locks held by find/1609:
>>> [  293.785921]  #0: ffff88813ddf6448 (sb_writers#16){.+.+}-{0:0}, at:
>>> do_rmdir+0x1a8/0x330
>>> [  293.793945]  #1: ffff88829e48a108 (&default_group_class[depth -
>>> 1]/1){+.+.}-{4:4}, at: do_rmdir+0x1ec/0x330
>>> [  293.803704]  #2: ffff8881f918cb48
>>> (&sb->s_type->i_mutex_key#22){+.+.}-{4:4}, at: vfs_rmdir+0xc0/0x5b0
>>> [  293.812943]  #3: ffffffffc1cc4698
>>> (&nullb_subsys.su_mutex){+.+.}-{4:4}, at: configfs_rmdir+0x57b/0xac0
>>> [  293.822267]  #4: ffffffffc1ccc130 (&lock){+.+.}-{4:4}, at:
>>> nullb_group_drop_item+0x50/0xd0 [null_blk]
>>> [  293.831516]  #5: ffff88829ddb9980 (&q->blkcg_mutex){+.+.}-{4:4},
>>> at: blkcg_deactivate_policy+0xf6/0x4e0
>>> [  293.840926]  #6: ffff88829ddb9650 (&q->queue_lock){..-.}-{3:3}, at:
>>> blkcg_deactivate_policy+0x10a/0x4e0
>>> [  293.850339]  #7: ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
>>> blkcg_deactivate_policy+0x1e7/0x4e0
>>> [  293.859577]
>>>                  stack backtrace:
>>> [  293.863939] CPU: 11 UID: 0 PID: 1609 Comm: find Kdump: loaded Not
>>> tainted 6.17.0-rc6.v1+ #2 PREEMPT(voluntary)
>>> [  293.863946] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
>>> 2.17.0 12/04/2024
>>> [  293.863949] Call Trace:
>>> [  293.863953]  <TASK>
>>> [  293.863959]  dump_stack_lvl+0x6f/0xb0
>>> [  293.863970]  print_circular_bug.cold+0x38/0x45
>>> [  293.863981]  check_noncircular+0x148/0x160
>>> [  293.863997]  check_prev_add+0xf1/0xcd0
>>> [  293.864001]  ? alloc_chain_hlocks+0x13e/0x1d0
>>> [  293.864007]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864013]  ? add_chain_cache+0x12c/0x310
>>> [  293.864022]  validate_chain+0x487/0x570
>>> [  293.864027]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864037]  __lock_acquire+0x57c/0xbd0
>>> [  293.864043]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864052]  lock_acquire.part.0+0xbd/0x260
>>> [  293.864057]  ? __timer_delete_sync+0x23/0x120
>>> [  293.864066]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864071]  ? rcu_is_watching+0x15/0xb0
>>> [  293.864076]  ? blkcg_deactivate_policy+0x1e7/0x4e0
>>> [  293.864080]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864085]  ? lock_acquire+0x10b/0x150
>>> [  293.864092]  ? __timer_delete_sync+0x23/0x120
>>> [  293.864098]  __timer_delete_sync+0x40/0x120
>>> [  293.864103]  ? __timer_delete_sync+0x23/0x120
>>> [  293.864111]  throtl_pd_free+0x19/0x40
>>> [  293.864117]  blkcg_deactivate_policy+0x2c9/0x4e0
>>> [  293.864132]  blk_throtl_exit+0xa5/0x100
>>> [  293.864140]  blkcg_exit_disk+0x1f/0x270
>>> [  293.864150]  disk_release+0x11b/0x3a0
>>> [  293.864157]  device_release+0x9f/0x210
>>> [  293.864164]  kobject_cleanup+0x105/0x360
>>> [  293.864172]  null_del_dev.part.0+0x1e5/0x480 [null_blk]
>>> [  293.864189]  nullb_group_drop_item+0xa5/0xd0 [null_blk]
>>> [  293.864202]  configfs_rmdir+0x69f/0xac0
>>> [  293.864210]  ? __pfx_may_link+0x10/0x10
>>> [  293.864221]  ? __pfx_configfs_rmdir+0x10/0x10
>>> [  293.864235]  vfs_rmdir+0x1a5/0x5b0
>>> [  293.864244]  do_rmdir+0x276/0x330
>>> [  293.864252]  ? __pfx_do_rmdir+0x10/0x10
>>> [  293.864262]  ? ktime_get_coarse_real_ts64+0x121/0x180
>>> [  293.864271]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864276]  ? getname_flags.part.0+0xf8/0x480
>>> [  293.864285]  __x64_sys_unlinkat+0x16b/0x1e0
>>> [  293.864293]  do_syscall_64+0x94/0x8d0
>>> [  293.864298]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864304]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864309]  ? fput_close_sync+0x156/0x1c0
>>> [  293.864316]  ? __pfx_fput_close_sync+0x10/0x10
>>> [  293.864326]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864333]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> [  293.864337]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864342]  ? lockdep_hardirqs_on+0x78/0x100
>>> [  293.864347]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864351]  ? do_syscall_64+0x139/0x8d0
>>> [  293.864355]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864362]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> [  293.864366]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864371]  ? lockdep_hardirqs_on+0x78/0x100
>>> [  293.864376]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864381]  ? do_syscall_64+0x139/0x8d0
>>> [  293.864385]  ? srso_return_thunk+0x5/0x5f
>>> [  293.864393]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>> [  293.864398] RIP: 0033:0x7f13bc09bb9b
>>> [  293.864404] Code: 77 05 c3 0f 1f 40 00 48 8b 15 71 52 0d 00 f7 d8
>>> 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 07 01 00
>>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 52 0d 00 f7 d8 64 89
>>> 01 48
>>> [  293.864408] RSP: 002b:00007ffee8a880e8 EFLAGS: 00000216 ORIG_RAX:
>>> 0000000000000107
>>> [  293.864414] RAX: ffffffffffffffda RBX: 00005593f2ad7fb0 RCX: 00007f13bc09bb9b
>>> [  293.864417] RDX: 0000000000000200 RSI: 00005593f2ad7fb0 RDI: 0000000000000005
>>> [  293.864421] RBP: 0000000000000200 R08: 00000000ffffffff R09: 00005593f2acfd50
>>> [  293.864424] R10: 00005593f2ac8010 R11: 0000000000000216 R12: 00005593f2ace330
>>> [  293.864427] R13: 0000000000000001 R14: 0000000000000005 R15: 00007ffee8a890bb
>>> [  293.864445]  </TASK>
>>>
>>> On Tue, Sep 16, 2025 at 9:20 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>>
>>>> Tightening the throttle activation check in blk_throtl_activated() to
>>>> require both q->td presence and policy bit set introduced a memory leak
>>>> during disk release:
>>>>
>>>> blkg_destroy_all() clears the policy bit first during queue deactivation,
>>>> causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
>>>> blk_throtl_activated() fails policy check.
>>>>
>>>> Fix by reordering operations in disk_release() to call blk_throtl_exit()
>>>> while throttle policy is still active. We avoid modifying blk_throtl_exit()
>>>> activation check because it's intuitive that blk-throtl start from
>>>> blk_throtl_init() and end in blk_throtl_exit().
>>>>
>>>> Fixes: bd9fd5be6bc0 ("blk-throttle: fix access race during throttle policy activation")
>>>> Reported-by: Yi Zhang <yi.zhang@...hat.com>
>>>> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/
>>>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>>>> ---
>>>>    block/blk-cgroup.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>>> index 7246fc256315..64a56c8697f9 100644
>>>> --- a/block/blk-cgroup.c
>>>> +++ b/block/blk-cgroup.c
>>>> @@ -1539,8 +1539,8 @@ int blkcg_init_disk(struct gendisk *disk)
>>>>
>>>>    void blkcg_exit_disk(struct gendisk *disk)
>>>>    {
>>>> -       blkg_destroy_all(disk);
>>>>           blk_throtl_exit(disk);
>>>> +       blkg_destroy_all(disk);
>>>>    }
>>>>
>>>>    static void blkcg_exit(struct task_struct *tsk)
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>>
>>
> 
>