lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b10dd129-76e0-0950-c647-cc0e1236383c@nvidia.com>
Date:   Mon, 17 Oct 2022 10:46:35 +0000
From:   Chaitanya Kulkarni <chaitanyak@...dia.com>
To:     Ming Lei <ming.lei@...hat.com>
CC:     "linux-block@...r.kernel.org" <linux-block@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "axboe@...nel.dk" <axboe@...nel.dk>,
        "damien.lemoal@...nsource.wdc.com" <damien.lemoal@...nsource.wdc.com>,
        "johannes.thumshirn@....com" <johannes.thumshirn@....com>,
        "bvanassche@....org" <bvanassche@....org>,
        "shinichiro.kawasaki@....com" <shinichiro.kawasaki@....com>,
        "vincent.fu@...sung.com" <vincent.fu@...sung.com>,
        "yukuai3@...wei.com" <yukuai3@...wei.com>
Subject: Re: [PATCH] null_blk: allow teardown on request timeout

On 10/17/22 03:16, Ming Lei wrote:
> On Mon, Oct 17, 2022 at 10:04:26AM +0000, Chaitanya Kulkarni wrote:
>> On 10/17/22 02:50, Ming Lei wrote:
>>> On Mon, Oct 17, 2022 at 09:30:47AM +0000, Chaitanya Kulkarni wrote:
>>>>
>>>>>> +	/*
>>>>>> +	 * Unblock any pending dispatch I/Os before we destroy the device.
>>>>>> +	 * From null_destroy_dev()->del_gendisk() will set GD_DEAD flag
>>>>>> +	 * causing any new I/O from __bio_queue_enter() to fail with -ENODEV.
>>>>>> +	 */
>>>>>> +	blk_mq_unquiesce_queue(nullb->q);
>>>>>> +
>>>>>> +	null_destroy_dev(nullb);
>>>>>
>>>>> destroying device is never good cleanup for handling timeout/abort, and it
>>>>> should have been the last straw any time.
>>>>>
>>>>
>>>> That is exactly why I've added the rq_abort_limit, so until the limit
>>>> is not reached null_abort_work() will not get scheduled and device is
>>>> not destroyed.
>>>
>>> I meant destroying device should only be done iff the normal abort handler
>>> can't recover the device, however, your patch simply destroys device
>>> without running any abort handling.
>>>
>>
>> I did not understand your comment, can you please elaborate on exactly
>> where and which abort handlers needs to be called in this patch before
>> null_destroy_nullb() ?
> 
> In case of request timeout, there may be something wrong which needs
> to be recovered.
> 

In case of null_blk there is no real backend controller hence we don't
have anything to try to recover, only recovery scenario is exercised by
allowing multiple timed out request and waiting for the rq_abort_limit
to be reached before teardown.

>>
>> the objective of this patch it to simulate the teardown scenario
>> from timeout handler so it can get tested on regular basis with
>> null_blk ...
> 
> Why does teardown scenario have to be triggered for timeout? That

The ideal way is to read the controller status from timeout and
check if it is recoverable, if controller is not recoverable then
we need to gracefully shutdown the device else continuing to issue
I/Os to the non-recoverable device can create more damage to the
device and potentially to the system, and I've encountered this
scenario where SSD was getting hot since device F/W had a bug regarding
temperature control reporting and it became non-responsive-> timing out
the requests in the SSD Qualification process and lack of teardown
made system non-responsive and we only figured it out by logging the
temperature with Vendor unique commands..

> looks you think teardown & destroying device for timeout is one normal
> and common way, but I think it is not, the device shouldn't be removed

No I do not think like that, null_blk has no backend to check for the
status of the device. As explained earlier the decision to remove
the device only needs to be made after reading controller's state and
confirming that it is in non-recoverable state, which is not possible to
check for null_blk so we cannot call abort routines before destroying
the device ..

> if it still can work. I have got such kind of complaints of disk

ofcourse it still can work see above explanation but if device status is
non-recoverable then it should be allow to gracefully teardown-canceling
the I/Os and removing from the system...

> disappeared just by request timeout, such as, nvme-pci.
> 

It will be great if we can start a new thread on linux-nvme list
to address the complaints that you are received, I'll be happy to review
and reply..

-ck

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ