linux-kernel - Nvidia MCP55 and WRITE FPDMA QUEUED failed commands

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <4E0275D8.6000001@fuzzy.cz>
Date:	Thu, 23 Jun 2011 01:08:08 +0200
From:	Tomas Vondra <tv@...zy.cz>
To:	linux-kernel@...r.kernel.org
Subject: Nvidia MCP55 and WRITE FPDMA QUEUED failed commands

Hi all,

a few days ago I've bought a new SSD (Intel 320), and it didn't take
long to get a bunch of I/O errors like this:

ata6: EH in SWNCQ mode,QC:qc_active 0x7FFFFFFF sactive 0x7FFFFFFF
ata6: SWNCQ:qc_active 0x1E031 defer_bits 0x7FFE1FCE last_issue_tag 0x10
  dhfis 0xE031 dmafis 0x6010 sdbfis 0x0
ata6: ATA_REG 0x40 ERR_REG 0x0
ata6: tag : dhfis dmafis sdbfis sacitve
ata6: tag 0x0: 1 0 0 1
ata6: tag 0x4: 1 1 0 1
ata6: tag 0x5: 1 0 0 1
ata6: tag 0xd: 1 1 0 1
ata6: tag 0xe: 1 1 0 1
ata6: tag 0xf: 1 0 0 1
ata6: tag 0x10: 0 0 0 1
ata6.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
ata6.00: failed command: WRITE FPDMA QUEUED
ata6.00: cmd 61/10:00:10:d7:f0/00:00:05:00:00/40 tag 0 ncq 8192 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: nv: skipping hardreset on occupied port
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
ata6.00: device reported invalid CHS sector 0

The machine just freezes for a few seconds and then everything works
fine again. Until the next bunch of errors - sometimes it's a few
minutes, sometimes a whole day.

The filesystem seems not to be corrupted (fsck finds no problem) and
everything seems to be OK.

The full dmesg output (including the errors) is available here:

  http://pastebin.com/uHvTVmss

I've been searching for possible causes / fixes, but no matter what I do
I still occassionally get those I/O errors :-(

It seems to be somehow related to the controller on my mobo - I'm using
Asus M2N-e with Nvidia MCP55, and I've found this:

  http://marc.info/?l=linux-kernel&m=126847285022959&w=2

which describes a similar issue (same failed command, a bit different
result). I've been using this mobo for a few years, everything worked
just fine till now (OK, I got a few panics, but in all cases it was my
stupid fault). I've switched there various HDDs from various vendors,
not a single problem.

The post mentions the problems may be related to SMART - not sure how to
confirm/refute this, but I'm somehow used that products from Intel work
fine most of the time. OTOH after executing a long self-test, smartctl
reports this (full output: http://pastebin.com/DwJfxdTK)

SMART Self-test log structure revision number 1
Num  Test_Description Status                  Remaining  ...
  1  Vendor (0x78)    Completed without error 150%       ...

That seems a bit fishy, of course. 150%? And how could it be already
completed when there's still 150% remaining?

What I've tried till today:

  1) flashed BIOS to a recent version

  2) switched from reiserfs 3.6 to ext4

  3) disabled the NCQ (libata.force=noncq kernel parameter)

  4) set DMA queue depth to 1 (hdparm -Q 1 /dev/sdb)

  5) upgraded from 2.6.36.1 to 2.6.38

None of those helped :-(

Any ideas how to solve those issues? If those are "just" timing errors
(i.e. the data are actually written but the drive does not notify that)
or is there a danger of corruption?

A bit more (possibly useful) info:

.config http://pastebin.com/PYeLKaBL
lspci output : http://pastebin.com/nQPS0rxU

regards
Tomas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/