linux-kernel - Re: [PATCH] blk-throttle: fix throtl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5c01287e-b7f8-6ca2-a08f-f6c66fff379c@huaweicloud.com>
Date: Tue, 16 Sep 2025 17:05:30 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yi Zhang <yi.zhang@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: tj@...nel.org, josef@...icpanda.com, axboe@...nel.dk,
 liangjie@...iang.com, hanguangjiang@...iang.com, cgroups@...r.kernel.org,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] blk-throttle: fix throtl_data leak during disk release

Hi,

在 2025/09/16 16:15, Yi Zhang 写道:
> Hi Yu
> A new issue was triggered with the change. Please check it. Thanks.
> 
> [  285.804104] run blktests throtl/001 at 2025-09-16 04:11:12
> [  286.161894] null_blk: disk dev_nullb created
> 
> [  293.388583] ======================================================
> [  293.394762] WARNING: possible circular locking dependency detected
> [  293.400940] 6.17.0-rc6.v1+ #2 Not tainted
> [  293.404952] ------------------------------------------------------
> [  293.411131] find/1609 is trying to acquire lock:
> [  293.415751] ffff8882911b50b0 ((&sq->pending_timer)){+.-.}-{0:0},
> at: __timer_delete_sync+0x23/0x120
> [  293.424817]
>                 but task is already holding lock:
> [  293.430648] ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
> blkcg_deactivate_policy+0x1e7/0x4e0
> [  293.439445]
>                 which lock already depends on the new lock.
> 
> [  293.447619]
>                 the existing dependency chain (in reverse order) is:
> [  293.455096]
>                 -> #2 (&blkcg->lock){....}-{3:3}:
> [  293.460948]        __lock_acquire+0x57c/0xbd0
> [  293.465315]        lock_acquire.part.0+0xbd/0x260
> [  293.470020]        _raw_spin_lock+0x37/0x80
> [  293.474214]        blkg_create+0x3e2/0x1060
> [  293.478401]        blkcg_init_disk+0x8f/0x460
> [  293.482769]        __alloc_disk_node+0x27f/0x600
> [  293.487397]        __blk_mq_alloc_disk+0x5f/0xd0
> [  293.492025]        nvme_alloc_ns+0x202/0x17a0 [nvme_core]
> [  293.497458]        nvme_scan_ns+0x30b/0x380 [nvme_core]
> [  293.502709]        async_run_entry_fn+0x9a/0x4f0
> [  293.507330]        process_one_work+0xd8b/0x1320
> [  293.511956]        worker_thread+0x5f3/0xfe0
> [  293.516231]        kthread+0x3b4/0x770
> [  293.519992]        ret_from_fork+0x393/0x480
> [  293.524272]        ret_from_fork_asm+0x1a/0x30
> [  293.528728]
>                 -> #1 (&q->queue_lock){..-.}-{3:3}:
> [  293.534749]        __lock_acquire+0x57c/0xbd0
> [  293.539108]        lock_acquire.part.0+0xbd/0x260
> [  293.543814]        _raw_spin_lock_irq+0x3f/0x90
> [  293.548348]        throtl_pending_timer_fn+0x11c/0x15b0
> [  293.553581]        call_timer_fn+0x19c/0x3e0
> [  293.557853]        __run_timers+0x627/0x9f0
> [  293.562041]        run_timer_base+0xe6/0x140
> [  293.566312]        run_timer_softirq+0x1a/0x30
> [  293.570758]        handle_softirqs+0x1fd/0x890
> [  293.575205]        __irq_exit_rcu+0xfd/0x250
> [  293.579477]        irq_exit_rcu+0xe/0x30
> [  293.583402]        sysvec_apic_timer_interrupt+0xa1/0xd0
> [  293.588717]        asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  293.594383]        cpuidle_enter_state+0xf5/0x2f0
> [  293.599090]        cpuidle_enter+0x4e/0xa0
> [  293.603197]        cpuidle_idle_call+0x213/0x370
> [  293.607816]        do_idle+0x131/0x200
> [  293.611568]        cpu_startup_entry+0x54/0x60
> [  293.616017]        start_secondary+0x20d/0x290
> [  293.620471]        common_startup_64+0x13e/0x141
> [  293.625096]
>                 -> #0 ((&sq->pending_timer)){+.-.}-{0:0}:
> [  293.631642]        check_prev_add+0xf1/0xcd0
> [  293.635921]        validate_chain+0x487/0x570
> [  293.640281]        __lock_acquire+0x57c/0xbd0
> [  293.644641]        lock_acquire.part.0+0xbd/0x260
> [  293.649345]        __timer_delete_sync+0x40/0x120
> [  293.654052]        throtl_pd_free+0x19/0x40
> [  293.658238]        blkcg_deactivate_policy+0x2c9/0x4e0
> [  293.663378]        blk_throtl_exit+0xa5/0x100
> [  293.667743]        blkcg_exit_disk+0x1f/0x270
> [  293.672104]        disk_release+0x11b/0x3a0
> [  293.676299]        device_release+0x9f/0x210
> [  293.680579]        kobject_cleanup+0x105/0x360
> [  293.685027]        null_del_dev.part.0+0x1e5/0x480 [null_blk]
> [  293.690788]        nullb_group_drop_item+0xa5/0xd0 [null_blk]
> [  293.696544]        configfs_rmdir+0x69f/0xac0
> [  293.700910]        vfs_rmdir+0x1a5/0x5b0
> [  293.704836]        do_rmdir+0x276/0x330
> [  293.708677]        __x64_sys_unlinkat+0x16b/0x1e0
> [  293.713393]        do_syscall_64+0x94/0x8d0
> [  293.717584]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  293.723160]
>                 other info that might help us debug this:
> 
> [  293.731159] Chain exists of:
>                   (&sq->pending_timer) --> &q->queue_lock --> &blkcg->lock
> 
> [  293.742045]  Possible unsafe locking scenario:
> 
> [  293.747972]        CPU0                    CPU1
> [  293.752511]        ----                    ----
> [  293.757043]   lock(&blkcg->lock);
> [  293.760371]                                lock(&q->queue_lock);
> [  293.766387]                                lock(&blkcg->lock);
> [  293.772226]   lock((&sq->pending_timer));
> [  293.776248]

This is true deadlock, however, I think the real problem here is
pd_free_fn() can be called inside queue_lock, and blk-throttle is
protecting with this queue_lock. And this problem should already
exist even before this patch, for example, from blkcg_activate_plicy()
error patch, pd_free_fn() is called with queue_lock held as well.

We can fix this by moving all the pd_free_fn() outside of queue_lock,
however, I think we should fix this in blk-throttle, by using internal
lock instead of reusing queue_lock.

Yi, althrogh I already tested, can you give following patch a test on
the top of this patch as well?

Thanks,
Kuai

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f510ae072868..aaf0aa7756bf 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -33,6 +33,8 @@ static struct workqueue_struct *kthrotld_workqueue;

  struct throtl_data
  {
+       spinlock_t lock;
+
         /* service tree for active throtl groups */
         struct throtl_service_queue service_queue;

@@ -1140,7 +1142,7 @@ static void throtl_pending_timer_fn(struct 
timer_list *t)
         else
                 q = td->queue;

-       spin_lock_irq(&q->queue_lock);
+       spin_lock_irq(&td->lock);

         if (!q->root_blkg)
                 goto out_unlock;
@@ -1166,9 +1168,9 @@ static void throtl_pending_timer_fn(struct 
timer_list *t)
                         break;

                 /* this dispatch windows is still open, relax and repeat */
-               spin_unlock_irq(&q->queue_lock);
+               spin_unlock_irq(&td->lock);
                 cpu_relax();
-               spin_lock_irq(&q->queue_lock);
+               spin_lock_irq(&td->lock);
         }

         if (!dispatched)
@@ -1191,7 +1193,7 @@ static void throtl_pending_timer_fn(struct 
timer_list *t)
                 queue_work(kthrotld_workqueue, &td->dispatch_work);
         }
  out_unlock:
-       spin_unlock_irq(&q->queue_lock);
+       spin_unlock_irq(&td->lock);
  }

  /**
@@ -1207,7 +1209,6 @@ static void blk_throtl_dispatch_work_fn(struct 
work_struct *work)
         struct throtl_data *td = container_of(work, struct throtl_data,
                                               dispatch_work);
         struct throtl_service_queue *td_sq = &td->service_queue;
-       struct request_queue *q = td->queue;
         struct bio_list bio_list_on_stack;
         struct bio *bio;
         struct blk_plug plug;
@@ -1215,11 +1216,11 @@ static void blk_throtl_dispatch_work_fn(struct 
work_struct *work)

         bio_list_init(&bio_list_on_stack);

-       spin_lock_irq(&q->queue_lock);
+       spin_lock_irq(&td->lock);
         for (rw = READ; rw <= WRITE; rw++)
                 while ((bio = throtl_pop_queued(td_sq, NULL, rw)))
                         bio_list_add(&bio_list_on_stack, bio);
-       spin_unlock_irq(&q->queue_lock);
+       spin_unlock_irq(&td->lock);

         if (!bio_list_empty(&bio_list_on_stack)) {
                 blk_start_plug(&plug);
@@ -1324,6 +1325,7 @@ static int blk_throtl_init(struct gendisk *disk)
         if (!td)
                 return -ENOMEM;

+       spin_lock_init(&td->lock);
         INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
         throtl_service_queue_init(&td->service_queue);

@@ -1694,7 +1696,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
         if (!blk_throtl_activated(q))
                 return;

-       spin_lock_irq(&q->queue_lock);
+       spin_lock_irq(&q->td->lock);
         /*
          * queue_lock is held, rcu lock is not needed here technically.
          * However, rcu lock is still held to emphasize that following
@@ -1713,7 +1715,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
                 tg_flush_bios(blkg_to_tg(blkg));
         }
         rcu_read_unlock();
-       spin_unlock_irq(&q->queue_lock);
+       spin_unlock_irq(&q->td->lock);
  }

  static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, 
bool rw)
@@ -1746,7 +1748,6 @@ static bool tg_within_limit(struct throtl_grp *tg, 
struct bio *bio, bool rw)

  bool __blk_throtl_bio(struct bio *bio)
  {
-       struct request_queue *q = bdev_get_queue(bio->bi_bdev);
         struct blkcg_gq *blkg = bio->bi_blkg;
         struct throtl_qnode *qn = NULL;
         struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1756,7 +1757,7 @@ bool __blk_throtl_bio(struct bio *bio)
         struct throtl_data *td = tg->td;

         rcu_read_lock();
-       spin_lock_irq(&q->queue_lock);
+       spin_lock_irq(&td->lock);
         sq = &tg->service_queue;

         while (true) {
@@ -1832,7 +1833,7 @@ bool __blk_throtl_bio(struct bio *bio)
         }

  out_unlock:
-       spin_unlock_irq(&q->queue_lock);
+       spin_unlock_irq(&td->lock);

         rcu_read_unlock();
         return throttled;

>                  *** DEADLOCK ***
> 
> [  293.782166] 8 locks held by find/1609:
> [  293.785921]  #0: ffff88813ddf6448 (sb_writers#16){.+.+}-{0:0}, at:
> do_rmdir+0x1a8/0x330
> [  293.793945]  #1: ffff88829e48a108 (&default_group_class[depth -
> 1]/1){+.+.}-{4:4}, at: do_rmdir+0x1ec/0x330
> [  293.803704]  #2: ffff8881f918cb48
> (&sb->s_type->i_mutex_key#22){+.+.}-{4:4}, at: vfs_rmdir+0xc0/0x5b0
> [  293.812943]  #3: ffffffffc1cc4698
> (&nullb_subsys.su_mutex){+.+.}-{4:4}, at: configfs_rmdir+0x57b/0xac0
> [  293.822267]  #4: ffffffffc1ccc130 (&lock){+.+.}-{4:4}, at:
> nullb_group_drop_item+0x50/0xd0 [null_blk]
> [  293.831516]  #5: ffff88829ddb9980 (&q->blkcg_mutex){+.+.}-{4:4},
> at: blkcg_deactivate_policy+0xf6/0x4e0
> [  293.840926]  #6: ffff88829ddb9650 (&q->queue_lock){..-.}-{3:3}, at:
> blkcg_deactivate_policy+0x10a/0x4e0
> [  293.850339]  #7: ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
> blkcg_deactivate_policy+0x1e7/0x4e0
> [  293.859577]
>                 stack backtrace:
> [  293.863939] CPU: 11 UID: 0 PID: 1609 Comm: find Kdump: loaded Not
> tainted 6.17.0-rc6.v1+ #2 PREEMPT(voluntary)
> [  293.863946] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
> 2.17.0 12/04/2024
> [  293.863949] Call Trace:
> [  293.863953]  <TASK>
> [  293.863959]  dump_stack_lvl+0x6f/0xb0
> [  293.863970]  print_circular_bug.cold+0x38/0x45
> [  293.863981]  check_noncircular+0x148/0x160
> [  293.863997]  check_prev_add+0xf1/0xcd0
> [  293.864001]  ? alloc_chain_hlocks+0x13e/0x1d0
> [  293.864007]  ? srso_return_thunk+0x5/0x5f
> [  293.864013]  ? add_chain_cache+0x12c/0x310
> [  293.864022]  validate_chain+0x487/0x570
> [  293.864027]  ? srso_return_thunk+0x5/0x5f
> [  293.864037]  __lock_acquire+0x57c/0xbd0
> [  293.864043]  ? srso_return_thunk+0x5/0x5f
> [  293.864052]  lock_acquire.part.0+0xbd/0x260
> [  293.864057]  ? __timer_delete_sync+0x23/0x120
> [  293.864066]  ? srso_return_thunk+0x5/0x5f
> [  293.864071]  ? rcu_is_watching+0x15/0xb0
> [  293.864076]  ? blkcg_deactivate_policy+0x1e7/0x4e0
> [  293.864080]  ? srso_return_thunk+0x5/0x5f
> [  293.864085]  ? lock_acquire+0x10b/0x150
> [  293.864092]  ? __timer_delete_sync+0x23/0x120
> [  293.864098]  __timer_delete_sync+0x40/0x120
> [  293.864103]  ? __timer_delete_sync+0x23/0x120
> [  293.864111]  throtl_pd_free+0x19/0x40
> [  293.864117]  blkcg_deactivate_policy+0x2c9/0x4e0
> [  293.864132]  blk_throtl_exit+0xa5/0x100
> [  293.864140]  blkcg_exit_disk+0x1f/0x270
> [  293.864150]  disk_release+0x11b/0x3a0
> [  293.864157]  device_release+0x9f/0x210
> [  293.864164]  kobject_cleanup+0x105/0x360
> [  293.864172]  null_del_dev.part.0+0x1e5/0x480 [null_blk]
> [  293.864189]  nullb_group_drop_item+0xa5/0xd0 [null_blk]
> [  293.864202]  configfs_rmdir+0x69f/0xac0
> [  293.864210]  ? __pfx_may_link+0x10/0x10
> [  293.864221]  ? __pfx_configfs_rmdir+0x10/0x10
> [  293.864235]  vfs_rmdir+0x1a5/0x5b0
> [  293.864244]  do_rmdir+0x276/0x330
> [  293.864252]  ? __pfx_do_rmdir+0x10/0x10
> [  293.864262]  ? ktime_get_coarse_real_ts64+0x121/0x180
> [  293.864271]  ? srso_return_thunk+0x5/0x5f
> [  293.864276]  ? getname_flags.part.0+0xf8/0x480
> [  293.864285]  __x64_sys_unlinkat+0x16b/0x1e0
> [  293.864293]  do_syscall_64+0x94/0x8d0
> [  293.864298]  ? srso_return_thunk+0x5/0x5f
> [  293.864304]  ? srso_return_thunk+0x5/0x5f
> [  293.864309]  ? fput_close_sync+0x156/0x1c0
> [  293.864316]  ? __pfx_fput_close_sync+0x10/0x10
> [  293.864326]  ? srso_return_thunk+0x5/0x5f
> [  293.864333]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  293.864337]  ? srso_return_thunk+0x5/0x5f
> [  293.864342]  ? lockdep_hardirqs_on+0x78/0x100
> [  293.864347]  ? srso_return_thunk+0x5/0x5f
> [  293.864351]  ? do_syscall_64+0x139/0x8d0
> [  293.864355]  ? srso_return_thunk+0x5/0x5f
> [  293.864362]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  293.864366]  ? srso_return_thunk+0x5/0x5f
> [  293.864371]  ? lockdep_hardirqs_on+0x78/0x100
> [  293.864376]  ? srso_return_thunk+0x5/0x5f
> [  293.864381]  ? do_syscall_64+0x139/0x8d0
> [  293.864385]  ? srso_return_thunk+0x5/0x5f
> [  293.864393]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  293.864398] RIP: 0033:0x7f13bc09bb9b
> [  293.864404] Code: 77 05 c3 0f 1f 40 00 48 8b 15 71 52 0d 00 f7 d8
> 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 07 01 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 52 0d 00 f7 d8 64 89
> 01 48
> [  293.864408] RSP: 002b:00007ffee8a880e8 EFLAGS: 00000216 ORIG_RAX:
> 0000000000000107
> [  293.864414] RAX: ffffffffffffffda RBX: 00005593f2ad7fb0 RCX: 00007f13bc09bb9b
> [  293.864417] RDX: 0000000000000200 RSI: 00005593f2ad7fb0 RDI: 0000000000000005
> [  293.864421] RBP: 0000000000000200 R08: 00000000ffffffff R09: 00005593f2acfd50
> [  293.864424] R10: 00005593f2ac8010 R11: 0000000000000216 R12: 00005593f2ace330
> [  293.864427] R13: 0000000000000001 R14: 0000000000000005 R15: 00007ffee8a890bb
> [  293.864445]  </TASK>
> 
> On Tue, Sep 16, 2025 at 9:20 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@...wei.com>
>>
>> Tightening the throttle activation check in blk_throtl_activated() to
>> require both q->td presence and policy bit set introduced a memory leak
>> during disk release:
>>
>> blkg_destroy_all() clears the policy bit first during queue deactivation,
>> causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
>> blk_throtl_activated() fails policy check.
>>
>> Fix by reordering operations in disk_release() to call blk_throtl_exit()
>> while throttle policy is still active. We avoid modifying blk_throtl_exit()
>> activation check because it's intuitive that blk-throtl start from
>> blk_throtl_init() and end in blk_throtl_exit().
>>
>> Fixes: bd9fd5be6bc0 ("blk-throttle: fix access race during throttle policy activation")
>> Reported-by: Yi Zhang <yi.zhang@...hat.com>
>> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/
>> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
>> ---
>>   block/blk-cgroup.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 7246fc256315..64a56c8697f9 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1539,8 +1539,8 @@ int blkcg_init_disk(struct gendisk *disk)
>>
>>   void blkcg_exit_disk(struct gendisk *disk)
>>   {
>> -       blkg_destroy_all(disk);
>>          blk_throtl_exit(disk);
>> +       blkg_destroy_all(disk);
>>   }
>>
>>   static void blkcg_exit(struct task_struct *tsk)
>> --
>> 2.39.2
>>
>>
> 
>