linux-kernel - Re: [PATCH] scsi: ata: Fix a race condition between scsi error handler and ahci interrupt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <977879af-8603-82ae-07ad-38be3a27194d@huaweicloud.com>
Date:   Mon, 14 Aug 2023 21:20:28 +0800
From:   Li Nan <linan666@...weicloud.com>
To:     Damien Le Moal <dlemoal@...nel.org>
Cc:     linux-ide@...r.kernel.org, linux-kernel@...r.kernel.org,
        linan122@...wei.com, yukuai3@...wei.com, yi.zhang@...wei.com,
        houtao1@...wei.com, yangerkun@...wei.com, jianghong011@...wei.com,
        zhangcheng75@...wei.com
Subject: Re: [PATCH] scsi: ata: Fix a race condition between scsi error
 handler and ahci interrupt


在 2023/8/14 15:50, Damien Le Moal 写道:
> On 8/14/23 15:41, Li Nan wrote:
>>> This is definitely not correct because EH may have been scheduled for a non
>>> fatal action like a device revalidate or to get sense data for successful
>>> commands. With this change, the port will NOT be frozen when a hard error IRQ
>>> comes while EH is waiting to start, that is, while EH waits for all commands to
>>> complete first.
>>>
>>
>> Yeah, we should find a better way to fix it. Do you have any suggesstions?
>>
>>> Furthermore, if you get an IRQ that requires the port to be frozen, it means
>>> that you had a failed command. In that case, the drive is in error state per
>>> ATA specs and stops all communication until a read log 10h command is issued.
>>> So you should never ever see 2 error IRQs one after the other. If you do, it
>>> very likely means that you have buggy hardware.
>>>
>>> How do you get into this situation ? What adapter and disk are you using ?
>>>
>>
>>   > How do you get into this situation ?
>> The first IRQ is io error, the second IRQ is disk link flash break.
> 
> What does "link flash break" mean ?
> 
>>
>>   > What adapter and disk are you using ?
>> It is a disk developed by our company, but we think the same issue
>> exists when using other disks.
> 
> As I said, I find this situation highly suspect because if the first IRQ was to
> signal an IO error that the drive reported, then per ATA specifications, the
> drive should be in error mode and should NOT have transmitted any other FIS
> after the SDB FIS that signaled the error. Nothing at all should come after that
> error SDB FIS, until the host issues a read log 10h to get thee drive out of
> error state.
> 
> If this is a prototype device, I would recommend that you take an ATA bus trace
> and verify the FIS traffic. Something fishy is going on with the drive in my
> opinion.
> 

Thank you for your patient explanation. I'm sorry I didn't explain the
problem clearly before. After discussing with my colleagues who know
more about dirvers, Let me re-describe the problem.

The problem`s situation is the SATA link is quickly disconnected and 
connected. For example, when an I/O error is processed in error handling 
thread, the disk is manually removed and inserted, and the AHCI chip 
reports a hot plug interrupt.

This scenario is not just an NCQ error, but a disk is removed and 
quickly inserted before the error processing is completed. For the error 
handling process, the disk status needs to be restored after the error 
handling is complete.

-- 
Thanks,
Nan