linux-kernel - Re: [Query] increased latency observed in cpu hotplug path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a5750bd-c4e6-8f41-1dc0-803ba8da944b@codeaurora.org>
Date:	Tue, 16 Aug 2016 11:23:40 +0530
From:	"Khan, Imran" <kimran@...eaurora.org>
To:	Akinobu Mita <akinobu.mita@...il.com>
Cc:	"\"axboe"@kernel.dk, axboe@...com, hch@....de,
	"tom.leiming\""@gmail.com, "\"linux-block"@vger.kernel.org,
	"\"\\\"linux-kernel\\\\\\\"\\\"@vger.kernel.org;tsoni\"@codeaurora.org;kaushalk"@codeaurora.org
Subject: Re: [Query] increased latency observed in cpu hotplug path

On 8/5/2016 12:49 PM, Khan, Imran wrote:
> On 8/1/2016 2:58 PM, Khan, Imran wrote:
>> On 7/30/2016 7:54 AM, Akinobu Mita wrote:
>>> 2016-07-28 22:18 GMT+09:00 Khan, Imran <kimran@...eaurora.org>:
>>>>
>>>> Hi,
>>>>
>>>> Recently we have observed some increased latency in CPU hotplug
>>>> event in CPU online path. For online latency we see that block
>>>> layer is executing notification handler for CPU_UP_PREPARE event
>>>> and this in turn waits for RCU grace period resulting (sometimes)
>>>> in an execution time of 15-20 ms for this notification handler.
>>>> This change was not there in 3.18 kernel but is present in 4.4
>>>> kernel and was introduced by following commit:
>>>>
>>>>
>>>> commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba
>>>> Author: Akinobu Mita <akinobu.mita@...il.com>
>>>> Date:   Sun Sep 27 02:09:23 2015 +0900
>>>>
>>>>     blk-mq: avoid inserting requests before establishing new mapping
>>>
>>> ...
>>>
>>>> Upon reverting this commit I could see an improvement of 15-20 ms in
>>>> online latency. So I am looking for some help in analyzing the effects
>>>> of reverting this or should some other approach to reduce the online
>>>> latency must be taken.
>>>
>>> Can you observe the difference in online latency by removing
>>> get_online_cpus() and put_online_cpus() pair in blk_mq_init_allocated_queue()
>>> instead of full reverting the commit?
>>>
>> Hi Akinobu,
>> I tried your suggestion but could not achieve any improvement. Actually the snippet that is causing the change in latency is the following one :
>>
>> list_for_each_entry(q, &all_q_list, all_q_node) {
>>                 blk_mq_freeze_queue_wait(q);
>>
>>                 /*
>>                  * timeout handler can't touch hw queue during the
>>                  * reinitialization
>>                  */
>>                 del_timer_sync(&q->timeout);
>>  }
>>
>> I understand that this is getting executed now for CPU_UP_PREPARE as well resulting in 
>> increased latency in the cpu online path. I am trying to reduce this latency while keeping the 
>> purpose of this commit intact. I would welcome further suggestions/feedback in this regard.
>>
> Hi Akinobu,
> 
> I am not able to reduce the cpu online latency with this patch, could you please let me know what
> functionality will be broken, if we avoid this patch in our kernel. Also if you have some other 
> suggestions towards improving this patch please let me know.
> 
After moving the remapping of queues to block layer's kworker I see that online latency has improved
while offline latency remains the same. As the freezing of queues happens in the context of block layer's
worker, I think it would be better to do the remapping in the same context and then go ahead with freezing.
In this regard I have made following change:

commit b2131b86eeef4c5b1f8adaf7a53606301aa6b624
Author: Imran Khan <kimran@...eaurora.org>
Date:   Fri Aug 12 19:59:47 2016 +0530

    blk-mq: Move block queue remapping from cpu hotplug path

    During a cpu hotplug, the hardware and software contexts mappings
    need to be updated in order to take into account requests
    submitted for the hotadded CPU. But if this mapping is done
    in hotplug notifier, it deteriorates the hotplug latency.
    So move the block queue remapping to block layer worker which
    results in significant improvements in hotplug latency.

    Change-Id: I01ac83178ce95c3a4e3b7b1b286eda65ff34e8c4
    Signed-off-by: Imran Khan <kimran@...eaurora.org>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..06fcf89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,7 +22,11 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
-
+#include <linux/kthread.h>
+#include <linux/mutex.h>
+#include <linux/sched.h>
+#include <linux/kthread.h>
+#include <linux/mutex.h>
 #include <trace/events/block.h>

 #include <linux/blk-mq.h>
@@ -32,10 +36,18 @@

 static DEFINE_MUTEX(all_q_mutex);
 static LIST_HEAD(all_q_list);
-
 static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx);

 /*
+ * New online cpumask which is going to be set in this hotplug event.
+ * Declare this cpumasks as global as cpu-hotplug operation is invoked
+ * one-by-one and dynamically allocating this could result in a failure.
+ */
+static struct cpumask online_new;
+
+static struct work_struct blk_mq_remap_work;
+
+/*
  * Check if any of the ctx's have pending work in this hardware queue
  */
 static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
@@ -2125,14 +2137,7 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
                                      unsigned long action, void *hcpu)
 {
-       struct request_queue *q;
        int cpu = (unsigned long)hcpu;
-       /*
-        * New online cpumask which is going to be set in this hotplug event.
-        * Declare this cpumasks as global as cpu-hotplug operation is invoked
-        * one-by-one and dynamically allocating this could result in a failure.
-        */
-       static struct cpumask online_new;

        /*
         * Before hotadded cpu starts handling requests, new mappings must
@@ -2155,43 +2160,17 @@ static int blk_mq_queue_reinit_notify(struct notifier_block *nb,
        case CPU_DEAD:
        case CPU_UP_CANCELED:
                cpumask_copy(&online_new, cpu_online_mask);
+               kblockd_schedule_work(&blk_mq_remap_work);
                break;
        case CPU_UP_PREPARE:
                cpumask_copy(&online_new, cpu_online_mask);
                cpumask_set_cpu(cpu, &online_new);
+               kblockd_schedule_work(&blk_mq_remap_work);
                break;
        default:
                return NOTIFY_OK;
        }

-       mutex_lock(&all_q_mutex);
-
-       /*
-        * We need to freeze and reinit all existing queues.  Freezing
-        * involves synchronous wait for an RCU grace period and doing it
-        * one by one may take a long time.  Start freezing all queues in
-        * one swoop and then wait for the completions so that freezing can
-        * take place in parallel.
-        */
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_freeze_queue_start(q);
-       list_for_each_entry(q, &all_q_list, all_q_node) {
-               blk_mq_freeze_queue_wait(q);
-
-               /*
-                * timeout handler can't touch hw queue during the
-                * reinitialization
-                */
-               del_timer_sync(&q->timeout);
-       }
-
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_queue_reinit(q, &online_new);
-
-       list_for_each_entry(q, &all_q_list, all_q_node)
-               blk_mq_unfreeze_queue(q);
-
-       mutex_unlock(&all_q_mutex);
        return NOTIFY_OK;
 }

@@ -2357,11 +2336,48 @@ void blk_mq_enable_hotplug(void)
        mutex_unlock(&all_q_mutex);
 }

+static void blk_remap_work(struct work_struct *work)
+{
+       struct request_queue *q;
+
+       /* Start the task of queue remapping */
+       mutex_lock(&all_q_mutex);
+
+       /*
+        * We need to freeze and reinit all existing queues.  Freezing
+        * involves synchronous wait for an RCU grace period and doing it
+        * one by one may take a long time.  Start freezing all queues in
+        * one swoop and then wait for the completions so that freezing can
+        * take place in parallel.
+        */
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_freeze_queue_start(q);
+       list_for_each_entry(q, &all_q_list, all_q_node) {
+               blk_mq_freeze_queue_wait(q);
+
+               /*
+                * timeout handler can't touch hw queue during the
+                * reinitialization
+                */
+               del_timer_sync(&q->timeout);
+       }
+
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_queue_reinit(q, &online_new);
+
+       list_for_each_entry(q, &all_q_list, all_q_node)
+               blk_mq_unfreeze_queue(q);
+
+       mutex_unlock(&all_q_mutex);
+}
+
+
 static int __init blk_mq_init(void)
 {
        blk_mq_cpu_init();

        hotcpu_notifier(blk_mq_queue_reinit_notify, 0);
+       INIT_WORK(&blk_mq_remap_work, blk_remap_work);

        return 0;
 }

I have run overnight tests of fsstress and multiple dd commands along with cpu hotplugging.
Could you please review this change and let me know if you see any issues with this change or if I need 
to test some specific corner cases to validate this change.

-- 
Imran Khan
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation