linux-kernel - Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0ed86e0-3e4a-d4d1-7b9d-c57f20538a80@gmail.com>
Date:   Thu, 27 Jul 2023 04:06:33 -0600
From:   TW <dalzot@...il.com>
To:     Damien Le Moal <dlemoal@...nel.org>,
        Thorsten Leemhuis <regressions@...mhuis.info>
Cc:     regressions@...ts.linux.dev,
        Mario Limonciello <mario.limonciello@....com>,
        Bart Van Assche <bvanassche@....org>,
        LKML <linux-kernel@...r.kernel.org>, stable@...r.kernel.org
Subject: Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

I retried on 6.5 rc3 without the Nvidia drivers and still received the 
same error and going to try for the patch next but got a malformed patch 
error on line 6 for the first patch for libata-scsi.c. The other two 
seem to go through just fine however.

Also the bugzilla link is similar to what I have but the disk doesn't 
disappear, comes back but just takes awhile to come back out of sleep mode.

On 7/26/23 17:39, Damien Le Moal wrote:
> On 7/26/23 22:47, Thorsten Leemhuis wrote:
>> Hi, Thorsten here, the Linux kernel's regression tracker.
>>
>> On 26.07.23 13:54, TW wrote:
>>> I have been having issues with the 6.x series of kernels resuming from
>>> suspend with one of my drives. Far as I can tell it has trouble with the
>>> cache on the drive when coming out of s3 sleep. Tried a few different
>>> distros (Manjaro, OpenMandriva Rome, EndeavourOS) all that give the same
>>> error message. It appears to work fine on the 5.15 kernel just fine
>>> however.
>>>
>>> This is the error or errors that I have been getting and assume has been
>>> holding up the system from resuming from suspend.
>>>
>>> Jul 20 04:13:41 rageworks kernel: ata10.00: device reported invalid CHS sector 0
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> This sense is garbage. This issue was reported already, but it is hard
> to deal with as it seems to be due to drives/adapters not correctly
> reporting status bits. So for now, let's ignore this sense codes.
>
> The start/stop unit failure is weird. On another case, I am suspecting
> that this command is causing a delay on resume, but not an error like this.
>
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: failed to resume async: error -5
>> Thx for your report. I CCed a few people, with a bit of luck they have
>> an idea. But I doubt it. If no one replies you likely will need a
>> bisection to find the root of the problem. But before going down that
>> route you want to check if latest mainline kernel (vanilla!) works better.
>>
>> FWIW, this is not my area of expertise, so the following might be a
>> misleading comment, but the problem looks somewhat similar to this one
>> that iirc was never solved:
>> https://bugzilla.kernel.org/show_bug.cgi?id=216087
>>
>>> Jul 20 04:12:51 rageworks systemd[1]: nvidia-suspend.service: Deactivated successfully.
>>> Jul 20 04:12:51 rageworks systemd[1]: Finished NVIDIA system suspend actions.
>>> Jul 20 04:12:51 rageworks systemd[1]: Starting System Suspend...
>> That sounds like you are using out-of tree drivers which can cause all
>> sorts of issues. Please recheck if the problem happens without those as
>> well and do not use them in all further tests to debug the issue.
> Yes. Please retest with the latest 6.5-rc3.
>
> And can you try this patch to see if it solves your issue ?
>
> commit 29e81d11812ee924d19425343ec69acd34af9d35
> Author: Damien Le Moal <dlemoal@...nel.org>
> Date:   Mon Jul 24 13:23:14 2023 +0900
>
>      ata,scsi: do not issue START STOP UNIT on resume
>
>      Signed-off-by: Damien Le Moal <dlemoal@...nel.org>
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 370d18aca71e..6184c7bcc16c 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1100,7 +1100,13 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct
> ata_device *dev)
>   		}
>   	} else {
>   		sdev->sector_size = ata_id_logical_sector_size(dev->id);
> +		/*
> +		 * Stop the drive on suspend but do not issue START STOP UNIT
> +		 * on resume as this is not necessary: the port is reset on
> +		 * resume, which wakes up the drive.
> +		 */
>   		sdev->manage_start_stop = 1;
> +		sdev->no_start_on_resume = 1;
>   	}
>
>   	/*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..b8584fe3123e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
>   static int sd_resume(struct device *dev)
>   {
>   	struct scsi_disk *sdkp = dev_get_drvdata(dev);
> -	int ret;
> +	int ret = 0;
>
>   	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
>   		return 0;
> @@ -3885,7 +3885,8 @@ static int sd_resume(struct device *dev)
>   		return 0;
>
>   	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> -	ret = sd_start_stop_device(sdkp, 1);
> +	if (!sdkp->device->no_start_on_resume)
> +		ret = sd_start_stop_device(sdkp, 1);
>   	if (!ret)
>   		opal_unlock_from_suspend(sdkp->opal_dev);
>   	return ret;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..b9230b6add04 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -194,6 +194,7 @@ struct scsi_device {
>   	unsigned no_start_on_add:1;	/* do not issue start on add */
>   	unsigned allow_restart:1; /* issue START_UNIT in error handler */
>   	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
> +	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
>   	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
>   	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
>   	unsigned select_no_atn:1;
>
>