linux-kernel - Re: ATA resets with Intel 8/C220 and HGST drive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 29 Apr 2015 22:59:58 +0300
From:	Selim T. Erdoğan <selim@...mni.cs.utexas.edu>
To:	Nicolas George <george@...p.org>
Cc:	debian-user@...ts.debian.org, linux-kernel@...r.kernel.org
Subject: Re: ATA resets with Intel 8/C220 and HGST drive

On Mon, Apr 27, 2015 at 03:03:58PM +0200, Nicolas George wrote:
> Summary: I had annoying resets of the SATA bus with a 8 Series/C220 Series
> Chipset controller and a HGST Travelstar 7K1000 drive. I recently managed to
> stop them and as far as I currently know I am satisfied; I write this mail
> in the hope that it may be useful for anyone having similar issues. If you
> do not have that issue and you are not a developer interested in fixing the
> issue more permanently, you can stop reading right now.
> 
> Here are the details. The computer is a Zotac ZBox ID91 nettop with a
> proprietary motherboard, and, as stated above, a Travelstar 7K1000 hard
> drive (a 7200 RPM 2.5", an unusual beast). It was installed around June
> 2014, and I noticed the problems some time later, they probably started
> right away.
> 
> The distribution was a Debian Jessie (testing) with the packaged kernel,
> probably linux-image-3.14-1-amd64:amd64 at the time; the issue was not fixed
> by upgrades.
> 
> The possibly relevant hardware information are these:
> 
> CPU: Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> 
> CPU:
> product: Intel(R) Core(TM) i3-4130T CPU @ 2.90GHz
> 
> description: SATA controller
> product: 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
> vendor: Intel Corporation
> physical id: 1f.2
> bus info: pci@...0:00:1f.2
> version: 05
> width: 32 bits
> clock: 66MHz
> capabilities: storage msi pm ahci_1.0 bus_master cap_list
> configuration: driver=ahci latency=0
> resources: irq:42 ioport:f0b0(size=8) ioport:f0a0(size=4) ioport:f090(size=8) ioport:f080(size=4) ioport:f060(size=32) memory:f7d1a000-f7d1a7ff
> 
> description: ATA Disk
> product: HGST HTS721010A9
> physical id: 0.0.0
> bus info: scsi@1:0.0.0
> logical name: /dev/sda
> version: A3J0
> serial: [REMOVED]
> size: 931GiB (1TB)
> capabilities: partitioned partitioned:dos
> configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096 signature=d3079a6d
> 
> The resets happened a few times a day (this computer was is kept on for more
> than a day and suspend is not used), mostly when the disk was in heavy use,
> sometimes as early as during the boot; there was a few good days when they
> did not happen. They were annoying because they caused a few seconds freeze
> of anything reading from disk; AFAIK they never resulted in data corruption.
> 
> The corresponding kernel messages look like this:
> 
> [  337.466498] ata2: EH complete
> [  367.251032] ata2.00: exception Emask 0x10 SAct 0x80000 SErr 0x400100 action 0x6 frozen
> [  367.251041] ata2.00: irq_stat 0x08000000, interface fatal error
> [  367.251046] ata2: SError: { UnrecovData Handshk }
> [  367.251053] ata2.00: failed command: WRITE FPDMA QUEUED
> [  367.251063] ata2.00: cmd 61/08:98:68:3b:40/00:00:6b:00:00/40 tag 19 ncq 4096 out
> [  367.251063]          res 50/00:08:68:3b:40/00:00:6b:00:00/40 Emask 0x10 (ATA bus error)
> [  367.251068] ata2.00: status: { DRDY }
> [  367.251075] ata2: hard resetting link
> [  367.571128] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  367.577660] ata2.00: configured for UDMA/133
> [  367.577676] ata2: EH complete
> [  409.772730] ata2: limiting SATA link speed to 3.0 Gbps
> [  409.772735] ata2.00: exception Emask 0x10 SAct 0x3fe00 SErr 0x400100 action 0x6 frozen
> [  409.772736] ata2.00: irq_stat 0x08000000, interface fatal error
> [  409.772737] ata2: SError: { UnrecovData Handshk }
> [  409.772739] ata2.00: failed command: READ FPDMA QUEUED
> [  409.772742] ata2.00: cmd 60/08:48:78:09:41/00:00:01:00:00/40 tag 9 ncq 4096 in
> [  409.772742]          res 50/00:28:e0:a3:04/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
> [  409.772743] ata2.00: status: { DRDY }
> <snip seven similar "failed command...DRDY" blocks>
> [  409.772773] ata2.00: failed command: WRITE FPDMA QUEUED
> [  409.772776] ata2.00: cmd 61/28:88:e0:a3:04/00:00:02:00:00/40 tag 17 ncq 20480 out
> [  409.772776]          res 50/00:28:e0:a3:04/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
> [  409.772777] ata2.00: status: { DRDY }
> [  409.772779] ata2: hard resetting link
> [  410.092732] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [  410.097670] ata2.00: configured for UDMA/133
> 
> Last week, hinted by the penultimate line, I tried to lower the speed of the
> SATA link permanently, and it worked. I did this by adding
> "libata.force=2:3.0Gbps" to the kernel command line (configured using
> /etc/default/grub).
> 
> Since then, no reset happened; I am confident that seven days without them
> are not a coincidence.

I had a similar experience with a Sony Vaio VGN-NS140 laptop (from 2008)
when its hard drive died a few years ago.  

The replacement drives (new or used) that I tried would work for a 
little while, usually long enough to install Debian, but would get 
corrupted within a few hours.  I would see messages like yours above, 
about going to a lower SATA speed.  From 3.0Gbps to 1.5 Gbps in my case.  
But that wouldn't keep the drive from getting corrupted.  (Maybe it 
was trying to auto-negotiate back to a higher speed, I don't remember.)  
I finally solved it like you, by permanently setting the libata.force 
option to 1.5Gbps.  It worked, but the new replacement drive I had 
bought was an SSD, so I was a little unhappy I had to use it at the
lower speed.

In my case, the original hard drive that came out of the machine, a 
Seagate Momentus, had a jumper which set the maximum speed to 1.5Gbps.  
Presumably, Sony knew that the machine wasn't able to handle higher 
speeds or auto-negotiation of the speed, so they set that jumper.
However, the replacement drives I tried didn't have such speed-limiting 
options, so I had to set it in the kernel module option.  (BTW, a few 
months ago I bought a used Thinkpad which came with a Seagate Momentus 
in it so I was able to set the jumper and stick that drive in the Sony, 
freeing up my SSD for use in the Thinkpad, at its "unreduced" speed.)

> 
> As I said, I consider the issue closed from my point of view. If someone
> wants to investigate further (for example a kernel hacker to actually fix
> this, or a distro developer to make an automatic work-around), I can give
> some more details, and possibly run a few tests if they do not take much
> time and are not too risky.
> 
> Hope this helps.
> 
> Regards,
> 
> -- 
>   Nicolas George
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/