linux-kernel - RE: [PATCH] blk-mq: avoid repeatedly scheduling the same work to run hardware queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <BL0PR2101MB1123D0597E3759918F3A0B11CE700@BL0PR2101MB1123.namprd21.prod.outlook.com>
Date:   Fri, 15 Nov 2019 23:25:13 +0000
From:   Long Li <longli@...rosoft.com>
To:     Damien Le Moal <Damien.LeMoal@....com>,
        "longli@...uxonhyperv.com" <longli@...uxonhyperv.com>,
        Jens Axboe <axboe@...nel.dk>, Ming Lei <ming.lei@...hat.com>,
        Sagi Grimberg <sagi@...mberg.me>,
        Keith Busch <keith.busch@...el.com>,
        Bart Van Assche <bvanassche@....org>,
        "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] blk-mq: avoid repeatedly scheduling the same work to run
 hardware queue

>Subject: Re: [PATCH] blk-mq: avoid repeatedly scheduling the same work to
>run hardware queue
>
>On 2019/11/15 7:51, longli@...uxonhyperv.com wrote:
>> From: Long Li <longli@...rosoft.com>
>>
>> SCSI layer calls blk_mq_run_hw_queues() in scsi_end_request(), for
>> every completed I/O. blk_mq_run_hw_queues() in turn schedules some
>> works to run the hardware queues.
>>
>> The actual work is queued by mod_delayed_work_on(), it turns out the
>> cost of this function is high on locking and CPU usage, when the I/O
>> workload has high queue depth. Most of these calls are not necessary
>> since the queue is already scheduled to run, and has not run yet.
>>
>> This patch tries to solve this problem by avoiding scheduling work
>> when it's already scheduled.
>>
>> Benchmark results:
>> The following tests are run on a RAM backed virtual disk on Hyper-V,
>> with 8 FIO jobs with 4k random read I/O. The test numbers are for IOPS.
>>
>> queue_depth	pre-patch	after-patch	improvement
>> 16		190k		190k		0%
>> 64		235k		240k		2%
>> 256		180k		256k		42%
>> 1024		156k		250k		60%
>>
>> Signed-off-by: Long Li <longli@...rosoft.com>
>> ---
>>  block/blk-mq.c         | 12 ++++++++++++
>>  include/linux/blk-mq.h |  1 +
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c index
>> ec791156e9cc..a882bd65167a 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1476,6 +1476,16 @@ static void
>__blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
>>  		put_cpu();
>>  	}
>>
>> +	/*
>> +	 * Queue a work to run queue. If this is a non-delay run and the
>> +	 * work is already scheduled, avoid scheduling the same work again.
>> +	 */
>> +	if (!msecs) {
>> +		if (test_bit(BLK_MQ_S_WORK_QUEUED, &hctx->state))
>> +			return;
>
>With this change, if the kblockd work is already scheduled with a delay, then
>the current no-delay run request will incur a delay because
>kblockd_mod_delayed_work_on() is not called, implying that
>__queue_delayed_work() does not execute __queue_work() as mandated
>by the 0 delay. The work is *not* started immediately.

BLK_MQ_S_WORK_QUEUED is not touched if this is for a delayed kblockd work, so the following non-delayed runs will get immediately scheduled.

But I think you have raised a valid point for delayed runs, consider the following sequence with three calls with quick succession.

1. __blk_mq_delay_run_hw_queue with no delay (this will set BLK_MQ_S_WORK_QUEUED)
2. __blk_mq_delay_run_hw_queue with a delay (this will modify the kblockd work if the work is not fired)
3. __blk_mq_delay_run_hw_queue with no delay (this will not do anything since BLK_MQ_S_WORK_QUEUED is already set)

This sequence can potentially wait until 2 fires. Ideally it should fire immediately at step 3.

I will send v2 to address this.

>
>While your results show improvements of IOPS at high queue depth, doesn't
>this change degrade IOPS and especially latency at low queue depth ?

Here are additional tests done in Azure and Hyper-v, with the patch:
NVMe showed little difference at all queue depths. (Samsung P983 and P963)
SCSI showed little difference when queue depth <64. (RAM disk and local VHD)


Long

>
>> +		set_bit(BLK_MQ_S_WORK_QUEUED, &hctx->state);
>> +	}
>> +
>>  	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
>&hctx->run_work,
>>  				    msecs_to_jiffies(msecs));
>>  }
>> @@ -1561,6 +1571,7 @@ void blk_mq_stop_hw_queue(struct
>blk_mq_hw_ctx *hctx)
>>  	cancel_delayed_work(&hctx->run_work);
>>
>>  	set_bit(BLK_MQ_S_STOPPED, &hctx->state);
>> +	clear_bit(BLK_MQ_S_WORK_QUEUED, &hctx->state);
>>  }
>>  EXPORT_SYMBOL(blk_mq_stop_hw_queue);
>>
>> @@ -1626,6 +1637,7 @@ static void blk_mq_run_work_fn(struct
>work_struct *work)
>>  	struct blk_mq_hw_ctx *hctx;
>>
>>  	hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work);
>> +	clear_bit(BLK_MQ_S_WORK_QUEUED, &hctx->state);
>>
>>  	/*
>>  	 * If we are stopped, don't run the queue.
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
>> 0bf056de5cc3..98269d3fd141 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -234,6 +234,7 @@ enum {
>>  	BLK_MQ_S_STOPPED	= 0,
>>  	BLK_MQ_S_TAG_ACTIVE	= 1,
>>  	BLK_MQ_S_SCHED_RESTART	= 2,
>> +	BLK_MQ_S_WORK_QUEUED	= 3,
>>
>>  	BLK_MQ_MAX_DEPTH	= 10240,
>>
>>
>
>
>--
>Damien Le Moal
>Western Digital Research