linux-kernel - Re: [bug report] block: Non-NCQ commands will never be executed while fio is continuously running

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <04cf3f31-4bd8-3ce9-867a-41628e56e861@huaweicloud.com>
Date: Thu, 19 Sep 2024 20:26:52 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Damien Le Moal <dlemoal@...nel.org>, Niklas Cassel <cassel@...nel.org>,
 yangxingui <yangxingui@...wei.com>
Cc: axboe@...nel.dk, John Garry <john.g.garry@...cle.com>,
 linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 James.Bottomley@...senpartnership.com,
 "Martin K. Petersen" <martin.petersen@...cle.com>,
 "yukuai (C)" <yukuai3@...wei.com>,
 "yangerkun@...wei.com" <yangerkun@...wei.com>
Subject: Re: [bug report] block: Non-NCQ commands will never be executed while
 fio is continuously running

Hi,

在 2024/09/11 6:38, Damien Le Moal 写道:
> On 9/10/24 20:27, Niklas Cassel wrote:
>> On Tue, Sep 10, 2024 at 02:34:06PM +0800, yangxingui wrote:
>>>
>>>
>>> On 2024/9/10 12:45, Damien Le Moal wrote:
>>>> On 9/10/24 10:09 AM, yangxingui wrote:
>>>>>
>>>>>
>>>>> On 2024/9/9 21:21, Damien Le Moal wrote:
>>>>>> On 9/9/24 22:10, yangxingui wrote:
>>>>>>> Hello axboe & John,
>>>>>>>
>>>>>>> After the driver exposes all HW queues to the block layer, non-NCQ
>>>>>>> commands will never be executed while fio is continuously running, such
>>>>>>> as a smartctl command.
>>>>>>>
>>>>>>> The cause of the problem is that other hctx used by the NCQ command is
>>>>>>> still active and can continue to issue NCQ commands to the sata disk.
>>>>>>> And the pio command keeps retrying in its corresponding hctx because
>>>>>>> qc_defer() always returns true.
>>>>>>>
>>>>>>> hctx0: ncq, pio, ncq
>>>>>>> hctx1：ncq, ncq, ...
>>>>>>> ...
>>>>>>> hctxn: ncq, ncq, ...
>>>>>>>
>>>>>>> Is there any good solution for this?
>>>>>>
>>>>>> SATA devices are single queue so how can you have multiple queues ?
>>>>>> What adapter are you using ?
>>>>>
>>>>> In the following patch, we expose the host's 16 hardware queues to the block
>>>>> layer. And when connecting to a sata disk, 16 hctx are used.
>>>>>
>>>>> 8d98416a55eb ("scsi: hisi_sas: Switch v3 hw to MQ")
>>>>
>>>> OK, so the HBA is a hisi one, using libsas...
>>>> What is the device ? An SSD ? and HDD ?
>>> Both SATA SSD and SATA HDD have this problem.
>>>
>>>>
>>>> Do you set a block I/O scheduler for the drive, e.g. mq-deadline. If not, does
>>>> setting a scheduler resolve the issue ?
>>> Currently, the default configuration mq-deadline is used, and the same
>>> phenomenon occurs when I try setting it to none. It seems to have nothing to
>>> do with the scheduling strategy.
>>>
>>>>
>>>> I do not have any hisi HBA. I use a lot of mpt3sas and mpi3mr HBAs which also
>>>> have multiple queues with a shared tagset. Never seen the issue you are
>>>> reporting though using HDDs with mq-deadline or bfq as the scheduler.
>>> Unlike libsas, as these hosts don't use qc_defer()?
>>
>> mpt3sas and mpi3mr do not use any libata code at all, the SCSI to ATA
>> Translation (SAT) is done completely by the HBA, so from a Linux
>> perspective, we are issuing SCSI commands to the HBA.
> 
> Yes, but we still can get requeue happening. Though for a SATA drive, that is
> unlikely since the max queue depth is clearly defined, unlike for SAS drives
> 
>> We can see that libsas uses ata_std_qc_defer() as its .qc_defer callback:
>> https://github.com/torvalds/linux/blob/v6.11-rc7/drivers/scsi/libsas/sas_ata.c#L566
> 
> And that may be the issue. More on this below.
> 
>> Without considering if it is a good idea or not, it should be possible to
>> translate some commands to instead use the "NCQ encapsulated" variant of
>> the ATA command that was used in the "ATA-16 passthrough" SCSI command.
> 
> That would be way too much work on the user side, and likely open up a can of
> device bugs unseen until now.
> 
>> To be able to send a non-queued command, there has to be no NCQ commands queued
>> on the device. I guess you could implement a scheduler that would be quiescing
>> the queue, processes the non-queued command, and then thaw the queue, but that
>> would essentially make non-queued commands high priority commands, and could
>> thus be used to seriously limit throughput by just sending some non-queued
>> commands every now and then :)
> 
> Passthrough commands do not go through the scheduler and are submitted directly
> to the dispatch queue, generally at the head of it (see blk_mq_insert_request()).
> 
> So for a single queue device, even if ata_qc_defer causes a requeue, the
> passthrough command ends up back at the top of the dispatch queue. After
> repeating this a few times, all in-flight NCQ commands complete and the
> passthrough command goes through.
> 
> But I feel this is very fragile given that the block layer requeue is done
> through a work item, so in parallel to an application submitting IOs. So in
> theory, I think that the requeue for the passthrough command could happen forever...
> 
> And for a multi-queue setup like with the hisi adapter, that is what is happening.
> 
> I do not have any good idea how to fix that yet. We need to find something.
> scsi_queue_rq() and the budget/host or device blocked state management may help
> with that, or we have a bug there... In any case, I do not think it is a block
> layer issue as the block layer knows nothing about NCQ vs non-NCQ.

Does libata return a specific value in this case? If so, maybe we can
stop other hctx untill this IO is handled.

For now, I think libata should use single hctx, it just doesn't support
multiple hctx yet.

Thanks,
Kuai

>