[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2ae28b7-a105-9cd6-bf2e-63051a4000b0@huaweicloud.com>
Date: Mon, 14 Aug 2023 14:41:48 +0800
From: Li Nan <linan666@...weicloud.com>
To: Damien Le Moal <dlemoal@...nel.org>
Cc: linux-ide@...r.kernel.org, linux-kernel@...r.kernel.org,
linan122@...wei.com, yukuai3@...wei.com, yi.zhang@...wei.com,
houtao1@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH] scsi: ata: Fix a race condition between scsi error
handler and ahci interrupt
在 2023/8/10 10:49, Damien Le Moal 写道:
> On 8/10/23 10:48, linan666@...weicloud.com wrote:
>> From: Li Nan <linan122@...wei.com>
>>
>
> Please explain the problem first instead of starting with a function call
> timeline which cannot ba analized without explanations.
>
>> interrupt scsi_eh
>>
>> ahci_error_intr
>> =>ata_port_freeze
>> =>__ata_port_freeze
>> =>ahci_freeze (turn IRQ off)
>> =>ata_port_abort
>> =>ata_port_schedule_eh
>> =>shost->host_eh_scheduled++;
>> host_eh_scheduled = 1
>> scsi_error_handler
>> =>ata_scsi_error
>> =>ata_scsi_port_error_handler
>> =>ahci_error_handler
>> . =>sata_pmp_error_handler
>> . =>ata_eh_thaw_port
>> . =>ahci_thaw (turn IRQ on)
>> ahci_error_intr .
>> =>ata_port_freeze .
>> =>__ata_port_freeze .
>> =>ahci_freeze (turn IRQ off) .
>> =>ata_port_abort .
>> =>ata_port_schedule_eh .
>> =>shost->host_eh_scheduled++; .
>> host_eh_scheduled = 2 .
>> =>ata_std_end_eh
>> =>host->host_eh_scheduled = 0;
>>
>> 'host_eh_scheduled' is 0 and scsi eh thread will not be scheduled again,
>> and the ata port remain freeze and will never be enabled.
>>
>> If EH thread is already running, no need to freeze port and schedule
>> EH again.
>>
>> Reported-by: luojian <luojian5@...wei.com>
>> Signed-off-by: Li Nan <linan122@...wei.com>
>> ---
>> drivers/ata/libahci.c | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
>> index e2bacedf28ef..0dfb0b807324 100644
>> --- a/drivers/ata/libahci.c
>> +++ b/drivers/ata/libahci.c
>> @@ -1840,9 +1840,17 @@ static void ahci_error_intr(struct ata_port *ap, u32 irq_stat)
>>
>> /* okay, let's hand over to EH */
>>
>> - if (irq_stat & PORT_IRQ_FREEZE)
>> + if (irq_stat & PORT_IRQ_FREEZE) {
>> + /*
>> + * EH already running, this may happen if the port is
>> + * thawed in the EH. But we cannot freeze it again
>> + * otherwise the port will never be thawed.
>> + */
>> + if (ap->pflags & (ATA_PFLAG_EH_PENDING |
>> + ATA_PFLAG_EH_IN_PROGRESS))
>> + return;
>
> This is definitely not correct because EH may have been scheduled for a non
> fatal action like a device revalidate or to get sense data for successful
> commands. With this change, the port will NOT be frozen when a hard error IRQ
> comes while EH is waiting to start, that is, while EH waits for all commands to
> complete first.
>
Yeah, we should find a better way to fix it. Do you have any suggesstions?
> Furthermore, if you get an IRQ that requires the port to be frozen, it means
> that you had a failed command. In that case, the drive is in error state per
> ATA specs and stops all communication until a read log 10h command is issued.
> So you should never ever see 2 error IRQs one after the other. If you do, it
> very likely means that you have buggy hardware.
>
> How do you get into this situation ? What adapter and disk are you using ?
>
> How do you get into this situation ?
The first IRQ is io error, the second IRQ is disk link flash break.
> What adapter and disk are you using ?
It is a disk developed by our company, but we think the same issue
exists when using other disks.
>> ata_port_freeze(ap);
>> - else if (fbs_need_dec) {
>> + } else if (fbs_need_dec) {
>> ata_link_abort(link);
>> ahci_fbs_dec_intr(ap);
>> } else
>
--
Thanks,
Nan
Powered by blists - more mailing lists