[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZrPw5m9LwMH5NQYy@x1-carbon.lan>
Date: Thu, 8 Aug 2024 00:10:46 +0200
From: Niklas Cassel <cassel@...nel.org>
To: Damien Le Moal <dlemoal@...nel.org>
Cc: Christian Heusel <christian@...sel.eu>,
Igor Pylypiv <ipylypiv@...gle.com>, linux-ide@...r.kernel.org,
Hannes Reinecke <hare@...e.de>, regressions@...ts.linux.dev,
stable@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [REGRESSION][BISECTED][STABLE] hdparm errors since 28ab9769117c
On Wed, Aug 07, 2024 at 11:26:46AM -0700, Damien Le Moal wrote:
> On 2024/08/07 10:23, Christian Heusel wrote:
> > Hello Igor, hello Niklas,
> >
> > on my NAS I am encountering the following issue since v6.6.44 (LTS),
> > when executing the hdparm command for my WD-WCC7K4NLX884 drives to get
> > the active or standby state:
> >
> > $ hdparm -C /dev/sda
> > /dev/sda:
> > SG_IO: bad/missing sense data, sb[]: f0 00 01 00 50 40 ff 0a 00 00 78 00 00 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > drive state is: unknown
> >
> >
> > While the expected output is the following:
> >
> > $ hdparm -C /dev/sda
> > /dev/sda:
> > drive state is: active/idle
> >
> > I did a bisection within the stable series and found the following
> > commit to be the first bad one:
> >
> > 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and no error")
> >
> > According to kernel.dance the same commit was also backported to the
> > v6.10.3 and v6.1.103 stable kernels and I could not find any commit or
> > pending patch with a "Fixes:" tag for the offending commit.
> >
> > So far I have not been able to test with the mainline kernel as this is
> > a remote device which I couldn't rescue in case of a boot failure. Also
> > just for transparency it does have the out of tree ZFS module loaded,
> > but AFAIU this shouldn't be an issue here, as the commit seems clearly
> > related to the error. If needed I can test with an untainted mainline
> > kernel on Friday when I'm near the device.
> >
> > I have attached the output of hdparm -I below and would be happy to
> > provide further debug information or test patches.
>
> I confirm this, using 6.11-rc2. The problem is actually hdparm code which
> assumes that the sense data is in descriptor format without ever looking at the
> D_SENSE bit to verify that. So commit 28ab9769117c reveals this issue because as
> its title explains, it (correctly) honors D_SENSE instead of always generating
> sense data in descriptor format.
You mean: the user space application is using the sense buffer without first
checking if the returned sense buffer is in descriptor or fixed format.
This seems like a fundamentally flawed assumption by the user space program.
If it doesn't even bother checking the first field in the sense buffer, sb[0],
perhaps it shouldn't bother trying to use the sense buffer at all.
(Yes, the D_SENSE bit can be configured by the user, but that doesn't change
the fact that a user space program must check the format of the returned buffer
before trying to use it.)
> Hmm... This is annoying. The kernel is fixed to be spec compliant but that
> breaks old/non-compliant applications... We definitely should fix hdparm code,
> but I think we still need to revert 28ab9769117c...
Well.. if we look at commit:
11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
https://github.com/torvalds/linux/commit/11093cb1ef56147fe33f5750b1eab347bdef30db
We can see that before that commit, the kernel used to call
ata_scsi_set_sense().
Back then ata_scsi_set_sense() was defined as:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/ata/libata-scsi.c#L280
scsi_build_sense_buffer(0, cmd->sense_buffer, sk, asc, ascq);
Where the first argument to scsi_build_sense_buffer() is if the generated sense
buffer should be fixed or desc format (0 == fixed format), so we used to
generate the sense buffer in fixed format:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db/drivers/scsi/scsi_common.c#L231
However, as we can see, the kernel then used to incorrectly just
change sb[0} to say that the buffer was in desc format,
without updating the other fields, e.g. sb[2]:
https://github.com/torvalds/linux/blob/11093cb1ef56147fe33f5750b1eab347bdef30db~/drivers/ata/libata-scsi.c#L1026
so the format was really in some franken format...
following neither fixed or descriptor format.
11093cb1ef56 ("libata-scsi: generate correct ATA pass-through sense")
did change so that successful ATA-passthrough commands always generated
the sense data in descriptor format. However, that commit also managed to
mess up the offsets for fixed format sense...
The commit that later changed ata_scsi_set_sense() to honor D_SENSE
was commit: 06dbde5f3a44 ("libata: Implement control mode page to select
sense format")
So basically:
Before commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
sense"), we generated sense data in some franken format for both successful
and failed ATA-passthrough commands.
After commit 11093cb1ef56 ("libata-scsi: generate correct ATA pass-through
sense") we generate sense data for sucessful ATA-passthrough commands in
descriptor format unconditionally, but still in franken format for failed
ATA-passthrough commands.
After commit 06dbde5f3a44 ("libata: Implement control mode page to select
sense format") we generate sense data for sucessful ATA-passthrough commands
in descriptor format unconditionally, but for failed commands we actually
honored D_SENSE to generate it either in fixed format or descriptor format.
(However, because of a bug in 11093cb1ef56, if using fixed format, the
offsets were wrong...)
The incorrect offsets for fixed format was fixed recently, in commit
38dab832c3f4 ("ata: libata-scsi: Fix offsets for the fixed format sense data")
Commit 28ab9769117c ("ata: libata-scsi: Honor the D_SENSE bit for CK_COND=1 and
no error") fixed so that we actually honor D_SENSE not only for failed
ATA-passthrough commands, but also for successfull ATA-passthrough commands.
TL;DR: it is very hard to say that we have introduced a regression, because
this crap has basically been broken in one way or another since it was
introduced... Personally, I would definitely want all the patches that are in
mainline in the kernel running on my machine, since that is the only thing
that is consistent.
However, that assumes that user space programs that are trying to parse the
sense data actually bothers to check the first field in the sense data,
to see which format the returned sense data is in... Applications that
do not even both with that will have problems on a lot of (historic) kernel
versions.
Kind regards,
Niklas
Powered by blists - more mailing lists