linux-kernel - blk-mq: takes hours for scsi scanning finish when thousands of LUNs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <a7028aac-e180-43eb-8099-cb0bb51b58d6@default>
Date:	Mon, 19 Oct 2015 07:40:13 -0700 (PDT)
From:	Zhangqing Luo <zhangqing.luo@...cle.com>
To:	<axboe@...nel.dk>, <tj@...nel.org>
Cc:	Guru Anbalagane <guru.anbalagane@...cle.com>,
	Feng Jin <joe.jin@...cle.com>, <linux-kernel@...r.kernel.org>
Subject: blk-mq: takes hours for scsi scanning finish when thousands of LUNs

Hi, Jens,Tejun

Current problem we meet:

When Multiple Queue is used for scsi, the period between each LUN probe
is increasing as the number of block request queue goes up, eventually 
it takes hours for Initiator to finish scanning thousands of LUNs.

Kernel version we're using: 4.1.6-14
I tested with MainLine, it also has same issue.

My analysis:

As for each new LUN, scsi_mq_alloc_queue is called when Multiple Queue
is enabled for scsi to allocate and initialize the new queue, the process is:
->blk_mq_init_queue
->blk_mq_init_allocated_queue
->blk_mq_add_queue_tag_set
->blk_mq_update_tag_set_depth

blk_mq_update_tag_set_depth calls blk_mq_freeze_queue for each 
queue in the set tag list, if the queue is already in the tag list, 
we can see that it takes tens of ticks to freeze the queue, if there are 
thousands of queues in the list, it will take more than ten seconds(HZ=1000) 
for a new queue to finish its initialization, this is really a high delay,

my question here is: why blk_mq_update_tag_set_depth updates
each queue in the tag list? 
if this thing must be done, as the code below shows just changing flags
depending on 'shared' variable why shouldn't we store the previous result of 'shared' 
and compare with the current result, if it's unchanged, nothing will be done
and avoid looping all queues in list.

static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set)
{
     struct blk_mq_hw_ctx *hctx;
     struct request_queue *q;
     bool shared;
     int i;

     if (set->tag_list.next == set->tag_list.prev)
         shared = false;
     else
         shared = true;

     list_for_each_entry(q, &set->tag_list, tag_set_list) {
         blk_mq_freeze_queue(q);

         queue_for_each_hw_ctx(q, hctx, i) {
             if (shared)
                 hctx->flags |= BLK_MQ_F_TAG_SHARED;
             else
                 hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
         }
         blk_mq_unfreeze_queue(q);
     }
}

Deep insight:
I dig more into the delay and find  more details, for each already existing
queue, the delay comes from  blk_mq_freeze_queue_wait function
because the request queue->mq_usage_counter is not zero, blocked on
it and woken up by percpu_ref_switch_to_atomic_rcu after a grace period, 
let me explain how this happens.

Although for each new allocating queue, PERCPU_REF_INIT_ATOMIC flag is set
when allocating the percpu ref, but when registering queue, 
blk_mq_finish_init changes it to PERCPU by calling percpu_ref_switch_to_percpu,

So every time blk_mq_freeze_queue_start, it runs in this way

blk_mq_freeze_queue_start
->percpu_ref_kill->percpu_ref_kill_and_confirm
->__percpu_ref_switch_to_atomic
->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu)

and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter
as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu
after a grace period


My question here is why should we change ref to PERCPU at blk_mq_finish_init?
because of this changing, delay appears.

My suggestion:

Don't change ref to PERCPU by blk_mq_finish_init, keeping
it in ATOMIC, this can solve the problem.
Or optimize blk_mq_update_tag_set_depth to avoid such delay

The patch below is one possible fix, as I can't see if it can introduce
other regression issue, so could you guy look into it and give some advice?

/Thanks



0001-Avoid-long-delay-for-scsi-scanning-when-thousands-of.patch

>From 4505e8d499a7c336ebc3b588b089f5408841b421 Mon Sep 17 00:00:00 2001
From: Jason Luo <zhangqing.luo@...cle.com>
Date: Mon, 19 Oct 2015 14:51:51 +0800
Subject: [PATCH] Avoid long delay for scsi scanning when thousands of LUNs

Signed-off-by: Jason Luo <zhangqing.luo@...cle.com>
---
  block/blk-mq-sysfs.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index b79685e..06533a9 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -404,7 +404,9 @@ static void blk_mq_sysfs_init(struct request_queue *q)
  /* see blk_register_queue() */
  void blk_mq_finish_init(struct request_queue *q)
  {
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
+	/* Maybe some other things can be done here in the future
+	 * leave it with empty */
+	/*percpu_ref_switch_to_percpu(&q->mq_usage_counter);*/
  }

  int blk_mq_register_disk(struct gendisk *disk)
-- 2.1.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/