linux-kernel - Re: [PATCH] blk-throttle: fix throtl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHj4cs8998+fwUxGLuMNpthxWN6SdCe=5=W5cvpSBYosvOCF9A@mail.gmail.com>
Date: Tue, 16 Sep 2025 22:29:09 +0800
From: Yi Zhang <yi.zhang@...hat.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: tj@...nel.org, josef@...icpanda.com, axboe@...nel.dk, liangjie@...iang.com, 
	hanguangjiang@...iang.com, cgroups@...r.kernel.org, 
	linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	yi.zhang@...wei.com, yangerkun@...wei.com, johnny.chenyi@...wei.com, 
	"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] blk-throttle: fix throtl_data leak during disk release

On Tue, Sep 16, 2025 at 5:05 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>
> Hi,
>
> 在 2025/09/16 16:15, Yi Zhang 写道:
> > Hi Yu
> > A new issue was triggered with the change. Please check it. Thanks.
> >
> > [  285.804104] run blktests throtl/001 at 2025-09-16 04:11:12
> > [  286.161894] null_blk: disk dev_nullb created
> >
> > [  293.388583] ======================================================
> > [  293.394762] WARNING: possible circular locking dependency detected
> > [  293.400940] 6.17.0-rc6.v1+ #2 Not tainted
> > [  293.404952] ------------------------------------------------------
> > [  293.411131] find/1609 is trying to acquire lock:
> > [  293.415751] ffff8882911b50b0 ((&sq->pending_timer)){+.-.}-{0:0},
> > at: __timer_delete_sync+0x23/0x120
> > [  293.424817]
> >                 but task is already holding lock:
> > [  293.430648] ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
> > blkcg_deactivate_policy+0x1e7/0x4e0
> > [  293.439445]
> >                 which lock already depends on the new lock.
> >
> > [  293.447619]
> >                 the existing dependency chain (in reverse order) is:
> > [  293.455096]
> >                 -> #2 (&blkcg->lock){....}-{3:3}:
> > [  293.460948]        __lock_acquire+0x57c/0xbd0
> > [  293.465315]        lock_acquire.part.0+0xbd/0x260
> > [  293.470020]        _raw_spin_lock+0x37/0x80
> > [  293.474214]        blkg_create+0x3e2/0x1060
> > [  293.478401]        blkcg_init_disk+0x8f/0x460
> > [  293.482769]        __alloc_disk_node+0x27f/0x600
> > [  293.487397]        __blk_mq_alloc_disk+0x5f/0xd0
> > [  293.492025]        nvme_alloc_ns+0x202/0x17a0 [nvme_core]
> > [  293.497458]        nvme_scan_ns+0x30b/0x380 [nvme_core]
> > [  293.502709]        async_run_entry_fn+0x9a/0x4f0
> > [  293.507330]        process_one_work+0xd8b/0x1320
> > [  293.511956]        worker_thread+0x5f3/0xfe0
> > [  293.516231]        kthread+0x3b4/0x770
> > [  293.519992]        ret_from_fork+0x393/0x480
> > [  293.524272]        ret_from_fork_asm+0x1a/0x30
> > [  293.528728]
> >                 -> #1 (&q->queue_lock){..-.}-{3:3}:
> > [  293.534749]        __lock_acquire+0x57c/0xbd0
> > [  293.539108]        lock_acquire.part.0+0xbd/0x260
> > [  293.543814]        _raw_spin_lock_irq+0x3f/0x90
> > [  293.548348]        throtl_pending_timer_fn+0x11c/0x15b0
> > [  293.553581]        call_timer_fn+0x19c/0x3e0
> > [  293.557853]        __run_timers+0x627/0x9f0
> > [  293.562041]        run_timer_base+0xe6/0x140
> > [  293.566312]        run_timer_softirq+0x1a/0x30
> > [  293.570758]        handle_softirqs+0x1fd/0x890
> > [  293.575205]        __irq_exit_rcu+0xfd/0x250
> > [  293.579477]        irq_exit_rcu+0xe/0x30
> > [  293.583402]        sysvec_apic_timer_interrupt+0xa1/0xd0
> > [  293.588717]        asm_sysvec_apic_timer_interrupt+0x1a/0x20
> > [  293.594383]        cpuidle_enter_state+0xf5/0x2f0
> > [  293.599090]        cpuidle_enter+0x4e/0xa0
> > [  293.603197]        cpuidle_idle_call+0x213/0x370
> > [  293.607816]        do_idle+0x131/0x200
> > [  293.611568]        cpu_startup_entry+0x54/0x60
> > [  293.616017]        start_secondary+0x20d/0x290
> > [  293.620471]        common_startup_64+0x13e/0x141
> > [  293.625096]
> >                 -> #0 ((&sq->pending_timer)){+.-.}-{0:0}:
> > [  293.631642]        check_prev_add+0xf1/0xcd0
> > [  293.635921]        validate_chain+0x487/0x570
> > [  293.640281]        __lock_acquire+0x57c/0xbd0
> > [  293.644641]        lock_acquire.part.0+0xbd/0x260
> > [  293.649345]        __timer_delete_sync+0x40/0x120
> > [  293.654052]        throtl_pd_free+0x19/0x40
> > [  293.658238]        blkcg_deactivate_policy+0x2c9/0x4e0
> > [  293.663378]        blk_throtl_exit+0xa5/0x100
> > [  293.667743]        blkcg_exit_disk+0x1f/0x270
> > [  293.672104]        disk_release+0x11b/0x3a0
> > [  293.676299]        device_release+0x9f/0x210
> > [  293.680579]        kobject_cleanup+0x105/0x360
> > [  293.685027]        null_del_dev.part.0+0x1e5/0x480 [null_blk]
> > [  293.690788]        nullb_group_drop_item+0xa5/0xd0 [null_blk]
> > [  293.696544]        configfs_rmdir+0x69f/0xac0
> > [  293.700910]        vfs_rmdir+0x1a5/0x5b0
> > [  293.704836]        do_rmdir+0x276/0x330
> > [  293.708677]        __x64_sys_unlinkat+0x16b/0x1e0
> > [  293.713393]        do_syscall_64+0x94/0x8d0
> > [  293.717584]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  293.723160]
> >                 other info that might help us debug this:
> >
> > [  293.731159] Chain exists of:
> >                   (&sq->pending_timer) --> &q->queue_lock --> &blkcg->lock
> >
> > [  293.742045]  Possible unsafe locking scenario:
> >
> > [  293.747972]        CPU0                    CPU1
> > [  293.752511]        ----                    ----
> > [  293.757043]   lock(&blkcg->lock);
> > [  293.760371]                                lock(&q->queue_lock);
> > [  293.766387]                                lock(&blkcg->lock);
> > [  293.772226]   lock((&sq->pending_timer));
> > [  293.776248]
>
> This is true deadlock, however, I think the real problem here is
> pd_free_fn() can be called inside queue_lock, and blk-throttle is
> protecting with this queue_lock. And this problem should already
> exist even before this patch, for example, from blkcg_activate_plicy()
> error patch, pd_free_fn() is called with queue_lock held as well.
>
> We can fix this by moving all the pd_free_fn() outside of queue_lock,
> however, I think we should fix this in blk-throttle, by using internal
> lock instead of reusing queue_lock.
>
> Yi, althrogh I already tested, can you give following patch a test on
> the top of this patch as well?

The issue cannot be reproduced now.

Tested-by: Yi Zhang <yi.zhang@...hat.com>

>
> Thanks,
> Kuai
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index f510ae072868..aaf0aa7756bf 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -33,6 +33,8 @@ static struct workqueue_struct *kthrotld_workqueue;
>
>   struct throtl_data
>   {
> +       spinlock_t lock;
> +
>          /* service tree for active throtl groups */
>          struct throtl_service_queue service_queue;
>
> @@ -1140,7 +1142,7 @@ static void throtl_pending_timer_fn(struct
> timer_list *t)
>          else
>                  q = td->queue;
>
> -       spin_lock_irq(&q->queue_lock);
> +       spin_lock_irq(&td->lock);
>
>          if (!q->root_blkg)
>                  goto out_unlock;
> @@ -1166,9 +1168,9 @@ static void throtl_pending_timer_fn(struct
> timer_list *t)
>                          break;
>
>                  /* this dispatch windows is still open, relax and repeat */
> -               spin_unlock_irq(&q->queue_lock);
> +               spin_unlock_irq(&td->lock);
>                  cpu_relax();
> -               spin_lock_irq(&q->queue_lock);
> +               spin_lock_irq(&td->lock);
>          }
>
>          if (!dispatched)
> @@ -1191,7 +1193,7 @@ static void throtl_pending_timer_fn(struct
> timer_list *t)
>                  queue_work(kthrotld_workqueue, &td->dispatch_work);
>          }
>   out_unlock:
> -       spin_unlock_irq(&q->queue_lock);
> +       spin_unlock_irq(&td->lock);
>   }
>
>   /**
> @@ -1207,7 +1209,6 @@ static void blk_throtl_dispatch_work_fn(struct
> work_struct *work)
>          struct throtl_data *td = container_of(work, struct throtl_data,
>                                                dispatch_work);
>          struct throtl_service_queue *td_sq = &td->service_queue;
> -       struct request_queue *q = td->queue;
>          struct bio_list bio_list_on_stack;
>          struct bio *bio;
>          struct blk_plug plug;
> @@ -1215,11 +1216,11 @@ static void blk_throtl_dispatch_work_fn(struct
> work_struct *work)
>
>          bio_list_init(&bio_list_on_stack);
>
> -       spin_lock_irq(&q->queue_lock);
> +       spin_lock_irq(&td->lock);
>          for (rw = READ; rw <= WRITE; rw++)
>                  while ((bio = throtl_pop_queued(td_sq, NULL, rw)))
>                          bio_list_add(&bio_list_on_stack, bio);
> -       spin_unlock_irq(&q->queue_lock);
> +       spin_unlock_irq(&td->lock);
>
>          if (!bio_list_empty(&bio_list_on_stack)) {
>                  blk_start_plug(&plug);
> @@ -1324,6 +1325,7 @@ static int blk_throtl_init(struct gendisk *disk)
>          if (!td)
>                  return -ENOMEM;
>
> +       spin_lock_init(&td->lock);
>          INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
>          throtl_service_queue_init(&td->service_queue);
>
> @@ -1694,7 +1696,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
>          if (!blk_throtl_activated(q))
>                  return;
>
> -       spin_lock_irq(&q->queue_lock);
> +       spin_lock_irq(&q->td->lock);
>          /*
>           * queue_lock is held, rcu lock is not needed here technically.
>           * However, rcu lock is still held to emphasize that following
> @@ -1713,7 +1715,7 @@ void blk_throtl_cancel_bios(struct gendisk *disk)
>                  tg_flush_bios(blkg_to_tg(blkg));
>          }
>          rcu_read_unlock();
> -       spin_unlock_irq(&q->queue_lock);
> +       spin_unlock_irq(&q->td->lock);
>   }
>
>   static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio,
> bool rw)
> @@ -1746,7 +1748,6 @@ static bool tg_within_limit(struct throtl_grp *tg,
> struct bio *bio, bool rw)
>
>   bool __blk_throtl_bio(struct bio *bio)
>   {
> -       struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>          struct blkcg_gq *blkg = bio->bi_blkg;
>          struct throtl_qnode *qn = NULL;
>          struct throtl_grp *tg = blkg_to_tg(blkg);
> @@ -1756,7 +1757,7 @@ bool __blk_throtl_bio(struct bio *bio)
>          struct throtl_data *td = tg->td;
>
>          rcu_read_lock();
> -       spin_lock_irq(&q->queue_lock);
> +       spin_lock_irq(&td->lock);
>          sq = &tg->service_queue;
>
>          while (true) {
> @@ -1832,7 +1833,7 @@ bool __blk_throtl_bio(struct bio *bio)
>          }
>
>   out_unlock:
> -       spin_unlock_irq(&q->queue_lock);
> +       spin_unlock_irq(&td->lock);
>
>          rcu_read_unlock();
>          return throttled;
>
> >                  *** DEADLOCK ***
> >
> > [  293.782166] 8 locks held by find/1609:
> > [  293.785921]  #0: ffff88813ddf6448 (sb_writers#16){.+.+}-{0:0}, at:
> > do_rmdir+0x1a8/0x330
> > [  293.793945]  #1: ffff88829e48a108 (&default_group_class[depth -
> > 1]/1){+.+.}-{4:4}, at: do_rmdir+0x1ec/0x330
> > [  293.803704]  #2: ffff8881f918cb48
> > (&sb->s_type->i_mutex_key#22){+.+.}-{4:4}, at: vfs_rmdir+0xc0/0x5b0
> > [  293.812943]  #3: ffffffffc1cc4698
> > (&nullb_subsys.su_mutex){+.+.}-{4:4}, at: configfs_rmdir+0x57b/0xac0
> > [  293.822267]  #4: ffffffffc1ccc130 (&lock){+.+.}-{4:4}, at:
> > nullb_group_drop_item+0x50/0xd0 [null_blk]
> > [  293.831516]  #5: ffff88829ddb9980 (&q->blkcg_mutex){+.+.}-{4:4},
> > at: blkcg_deactivate_policy+0xf6/0x4e0
> > [  293.840926]  #6: ffff88829ddb9650 (&q->queue_lock){..-.}-{3:3}, at:
> > blkcg_deactivate_policy+0x10a/0x4e0
> > [  293.850339]  #7: ffff8882b7794948 (&blkcg->lock){....}-{3:3}, at:
> > blkcg_deactivate_policy+0x1e7/0x4e0
> > [  293.859577]
> >                 stack backtrace:
> > [  293.863939] CPU: 11 UID: 0 PID: 1609 Comm: find Kdump: loaded Not
> > tainted 6.17.0-rc6.v1+ #2 PREEMPT(voluntary)
> > [  293.863946] Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS
> > 2.17.0 12/04/2024
> > [  293.863949] Call Trace:
> > [  293.863953]  <TASK>
> > [  293.863959]  dump_stack_lvl+0x6f/0xb0
> > [  293.863970]  print_circular_bug.cold+0x38/0x45
> > [  293.863981]  check_noncircular+0x148/0x160
> > [  293.863997]  check_prev_add+0xf1/0xcd0
> > [  293.864001]  ? alloc_chain_hlocks+0x13e/0x1d0
> > [  293.864007]  ? srso_return_thunk+0x5/0x5f
> > [  293.864013]  ? add_chain_cache+0x12c/0x310
> > [  293.864022]  validate_chain+0x487/0x570
> > [  293.864027]  ? srso_return_thunk+0x5/0x5f
> > [  293.864037]  __lock_acquire+0x57c/0xbd0
> > [  293.864043]  ? srso_return_thunk+0x5/0x5f
> > [  293.864052]  lock_acquire.part.0+0xbd/0x260
> > [  293.864057]  ? __timer_delete_sync+0x23/0x120
> > [  293.864066]  ? srso_return_thunk+0x5/0x5f
> > [  293.864071]  ? rcu_is_watching+0x15/0xb0
> > [  293.864076]  ? blkcg_deactivate_policy+0x1e7/0x4e0
> > [  293.864080]  ? srso_return_thunk+0x5/0x5f
> > [  293.864085]  ? lock_acquire+0x10b/0x150
> > [  293.864092]  ? __timer_delete_sync+0x23/0x120
> > [  293.864098]  __timer_delete_sync+0x40/0x120
> > [  293.864103]  ? __timer_delete_sync+0x23/0x120
> > [  293.864111]  throtl_pd_free+0x19/0x40
> > [  293.864117]  blkcg_deactivate_policy+0x2c9/0x4e0
> > [  293.864132]  blk_throtl_exit+0xa5/0x100
> > [  293.864140]  blkcg_exit_disk+0x1f/0x270
> > [  293.864150]  disk_release+0x11b/0x3a0
> > [  293.864157]  device_release+0x9f/0x210
> > [  293.864164]  kobject_cleanup+0x105/0x360
> > [  293.864172]  null_del_dev.part.0+0x1e5/0x480 [null_blk]
> > [  293.864189]  nullb_group_drop_item+0xa5/0xd0 [null_blk]
> > [  293.864202]  configfs_rmdir+0x69f/0xac0
> > [  293.864210]  ? __pfx_may_link+0x10/0x10
> > [  293.864221]  ? __pfx_configfs_rmdir+0x10/0x10
> > [  293.864235]  vfs_rmdir+0x1a5/0x5b0
> > [  293.864244]  do_rmdir+0x276/0x330
> > [  293.864252]  ? __pfx_do_rmdir+0x10/0x10
> > [  293.864262]  ? ktime_get_coarse_real_ts64+0x121/0x180
> > [  293.864271]  ? srso_return_thunk+0x5/0x5f
> > [  293.864276]  ? getname_flags.part.0+0xf8/0x480
> > [  293.864285]  __x64_sys_unlinkat+0x16b/0x1e0
> > [  293.864293]  do_syscall_64+0x94/0x8d0
> > [  293.864298]  ? srso_return_thunk+0x5/0x5f
> > [  293.864304]  ? srso_return_thunk+0x5/0x5f
> > [  293.864309]  ? fput_close_sync+0x156/0x1c0
> > [  293.864316]  ? __pfx_fput_close_sync+0x10/0x10
> > [  293.864326]  ? srso_return_thunk+0x5/0x5f
> > [  293.864333]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  293.864337]  ? srso_return_thunk+0x5/0x5f
> > [  293.864342]  ? lockdep_hardirqs_on+0x78/0x100
> > [  293.864347]  ? srso_return_thunk+0x5/0x5f
> > [  293.864351]  ? do_syscall_64+0x139/0x8d0
> > [  293.864355]  ? srso_return_thunk+0x5/0x5f
> > [  293.864362]  ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  293.864366]  ? srso_return_thunk+0x5/0x5f
> > [  293.864371]  ? lockdep_hardirqs_on+0x78/0x100
> > [  293.864376]  ? srso_return_thunk+0x5/0x5f
> > [  293.864381]  ? do_syscall_64+0x139/0x8d0
> > [  293.864385]  ? srso_return_thunk+0x5/0x5f
> > [  293.864393]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [  293.864398] RIP: 0033:0x7f13bc09bb9b
> > [  293.864404] Code: 77 05 c3 0f 1f 40 00 48 8b 15 71 52 0d 00 f7 d8
> > 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 07 01 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 52 0d 00 f7 d8 64 89
> > 01 48
> > [  293.864408] RSP: 002b:00007ffee8a880e8 EFLAGS: 00000216 ORIG_RAX:
> > 0000000000000107
> > [  293.864414] RAX: ffffffffffffffda RBX: 00005593f2ad7fb0 RCX: 00007f13bc09bb9b
> > [  293.864417] RDX: 0000000000000200 RSI: 00005593f2ad7fb0 RDI: 0000000000000005
> > [  293.864421] RBP: 0000000000000200 R08: 00000000ffffffff R09: 00005593f2acfd50
> > [  293.864424] R10: 00005593f2ac8010 R11: 0000000000000216 R12: 00005593f2ace330
> > [  293.864427] R13: 0000000000000001 R14: 0000000000000005 R15: 00007ffee8a890bb
> > [  293.864445]  </TASK>
> >
> > On Tue, Sep 16, 2025 at 9:20 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@...wei.com>
> >>
> >> Tightening the throttle activation check in blk_throtl_activated() to
> >> require both q->td presence and policy bit set introduced a memory leak
> >> during disk release:
> >>
> >> blkg_destroy_all() clears the policy bit first during queue deactivation,
> >> causing subsequent blk_throtl_exit() to skip throtl_data cleanup when
> >> blk_throtl_activated() fails policy check.
> >>
> >> Fix by reordering operations in disk_release() to call blk_throtl_exit()
> >> while throttle policy is still active. We avoid modifying blk_throtl_exit()
> >> activation check because it's intuitive that blk-throtl start from
> >> blk_throtl_init() and end in blk_throtl_exit().
> >>
> >> Fixes: bd9fd5be6bc0 ("blk-throttle: fix access race during throttle policy activation")
> >> Reported-by: Yi Zhang <yi.zhang@...hat.com>
> >> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/
> >> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
> >> ---
> >>   block/blk-cgroup.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> >> index 7246fc256315..64a56c8697f9 100644
> >> --- a/block/blk-cgroup.c
> >> +++ b/block/blk-cgroup.c
> >> @@ -1539,8 +1539,8 @@ int blkcg_init_disk(struct gendisk *disk)
> >>
> >>   void blkcg_exit_disk(struct gendisk *disk)
> >>   {
> >> -       blkg_destroy_all(disk);
> >>          blk_throtl_exit(disk);
> >> +       blkg_destroy_all(disk);
> >>   }
> >>
> >>   static void blkcg_exit(struct task_struct *tsk)
> >> --
> >> 2.39.2
> >>
> >>
> >
> >
>


-- 
Best Regards,
  Yi Zhang