lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2cbf591c-8284-8499-7804-e7078cf274d2@huawei.com>
Date:   Wed, 13 Nov 2019 14:57:33 +0000
From:   John Garry <john.garry@...wei.com>
To:     Hannes Reinecke <hare@...e.de>,
        "axboe@...nel.dk" <axboe@...nel.dk>,
        "jejb@...ux.ibm.com" <jejb@...ux.ibm.com>,
        "martin.petersen@...cle.com" <martin.petersen@...cle.com>
CC:     "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        "ming.lei@...hat.com" <ming.lei@...hat.com>,
        "hare@...e.com" <hare@...e.com>,
        "bvanassche@....org" <bvanassche@....org>,
        "chenxiang (M)" <chenxiang66@...ilicon.com>
Subject: Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset

On 13/11/2019 14:06, Hannes Reinecke wrote:
> On 11/13/19 2:36 PM, John Garry wrote:
>> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
>> multiple reply queues with single hostwide tags.
>>
>> In addition, these drivers want to use interrupt assignment in
>> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
>> CPU hotplug may cause in-flight IO completion to not be serviced when an
>> interrupt is shutdown.
>>
>> To solve that problem, Ming's patchset to drain hctx's should ensure no
>> IOs are missed in-flight [1].
>>
>> However, to take advantage of that patchset, we need to map the HBA HW
>> queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.
>>
>> In making that transition, the per-SCSI command request tags are no
>> longer unique per Scsi host - they are just unique per hctx. As such, the
>> HBA LLDD would have to generate this tag internally, which has a certain
>> performance overhead.
>>
>> However another problem is that blk mq assumes the host may accept
>> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
>> host busy counter, which would stop the LLDD being sent more than
>> .can_queue commands; however, we should still ensure that the block layer
>> does not issue more than .can_queue commands to the Scsi host.
>>
>> To solve this problem, introduce a shared tags per blk_mq_tag_set, which
>> may be requested when allocating the tagset.
>>
>> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
>> tagset.
>>
>> This is based on work originally from Ming Lei in [3].
>>
>> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
>> [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
>> [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
>> [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
>>
>> Signed-off-by: John Garry <john.garry@...wei.com>
>> ---
>>   block/blk-core.c       |  1 +
>>   block/blk-flush.c      |  2 +
>>   block/blk-mq-debugfs.c |  2 +-
>>   block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
>>   block/blk-mq-tag.h     |  1 +
>>   block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
>>   block/blk-mq.h         |  9 +++++
>>   include/linux/blk-mq.h |  3 ++
>>   include/linux/blkdev.h |  1 +
>>   9 files changed, 155 insertions(+), 10 deletions(-)
>>
> [ .. ]
>> @@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
>>   		blk_mq_tag_busy(data->hctx);
>>   	}
>>   
>> -	tag = blk_mq_get_tag(data);
>> -	if (tag == BLK_MQ_TAG_FAIL) {
>> -		if (clear_ctx_on_error)
>> -			data->ctx = NULL;
>> -		blk_queue_exit(q);
>> -		return NULL;
>> +	if (data->hctx->shared_tags) {
>> +		shared_tag = blk_mq_get_shared_tag(data);
>> +		if (shared_tag == BLK_MQ_TAG_FAIL)
>> +			goto err_shared_tag;
>>   	}
>>   
>> -	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
>> +	tag = blk_mq_get_tag(data);
>> +	if (tag == BLK_MQ_TAG_FAIL)
>> +		goto err_tag;
>> +
>> +	rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
>>   	if (!op_is_flush(data->cmd_flags)) {
>>   		rq->elv.icq = NULL;
>>   		if (e && e->type->ops.prepare_request) {

Hi Hannes,

> Why do you need to keep a parallel tag accounting between 'normal' and
> 'shared' tags here?
> Isn't is sufficient to get a shared tag only, and us that in lieo of the
> 'real' one?

In theory, yes. Just the 'shared' tag should be adequate.

A problem I see with this approach is that we lose the identity of which 
tags are allocated for each hctx. As an example for this, consider 
blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx. 
Now, if you're just using shared tags only, that wouldn't work.

Consider blk_mq_can_queue() as another example - this tells us if a hctx 
has any bits unset, while with only using shared tags it would tell if 
any bits unset over all queues, and this change in semantics could break 
things. At a glance, function __blk_mq_tag_idle() looks problematic also.

And this is where it becomes messy to implement.

> 
> I would love to combine both,

Same here...

  as then we can easily do a reverse mapping
> by using the 'tag' value to lookup the command itself, and can possibly
> do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within
> the command. With this split we'll be wasting quite some memory there,
> as the possible 'tag' values are actually nr_hw_queues * shared_tags.

Yeah, understood. That's just a trade-off I saw.

Thanks,
John

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ