lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4AB239C8.2020203@rtr.ca>
Date:	Thu, 17 Sep 2009 09:29:44 -0400
From:	Mark Lord <liml@....ca>
To:	Chris Webb <chris@...chsys.com>
Cc:	Tejun Heo <teheo@...e.de>, linux-scsi@...r.kernel.org,
	Ric Wheeler <rwheeler@...hat.com>,
	Andrei Tanas <andrei@...as.ca>, NeilBrown <neilb@...e.de>,
	linux-kernel@...r.kernel.org,
	IDE/ATA development list <linux-ide@...r.kernel.org>,
	Jeff Garzik <jgarzik@...hat.com>, Mark Lord <mlord@...ox.com>
Subject: Re: MD/RAID time out writing superblock

Chris Webb wrote:
> Mark Lord <liml@....ca> writes:
> 
>> I suspect we're missing some info from this specific failure.
>> Looking back at Chris's earlier posting, the whole thing started
>> with a FLUSH_CACHE_EXT failure.  Once that happens, all bets are
>> off on anything that follows.
>>
>>> Everything will be running fine when suddenly:
>>>
>>>  ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>>  ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>>          res 40/00:00:80:17:91/00:00:37:00:00/40 Emask 0x4 (timeout)
>>>  ata1.00: status: { DRDY }
>>>  ata1: hard resetting link
>>>  ata1: softreset failed (device not ready)
>>>  ata1: hard resetting link
>>>  ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>  ata1.00: configured for UDMA/133
>>>  ata1: EH complete
>>>  end_request: I/O error, dev sda, sector 1465147272
>>>  md: super_written gets error=-5, uptodate=0
>>>  raid10: Disk failure on sda3, disabling device.
>>>  raid10: Operation continuing on 5 devices.
> 
> Hi Mark. Yes, when the first timeout after a clean boot happens, it's with
> an 0xea flush command every time:
..

Yes.  Is this still happening from time to time now?
If so, disable the smartmontools daemon (smartd) and see if the problem goes away.
And especially disable hddtemp (which issues SMART commands) if that is also around.

It would be good to discover if those are the triggers for what's happening here.

Tejun.. do we do a FLUSH CACHE before issuing a non-NCQ command ?
If not, then I think we may need to add code to do it.


Cheers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ