[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <47EEE4BF.5080609@gmail.com>
Date: Sun, 30 Mar 2008 09:54:23 +0900
From: Tejun Heo <htejun@...il.com>
To: Hans-Peter Jansen <hpj@...la.net>
CC: Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org, linux-ide@...r.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Hello,
Hans-Peter Jansen wrote:
>>>> Should I be worried? smartd doesn't show anything suspicious on those.
>> Can you please post the result of "smartctl -a /dev/sdX"?
>
> Here's the last smart report from two of the offending drives. As noted
> before, I did the hardware reorganization, replaced the dog slow 3ware
> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives
> for now, but a nephew already showed interest. What do you think, can I
> cede those drives with a clear conscience? The Hardware_ECC_Recovered
> values are really worrisome, aren't they?
Different vendors use different scales for the raw values. The value is
still pegged at the highest so it could be those raw values are okay or
that the vendor just doesn't update value field accordingly. My P120
says 0 for the raw value and 904635 for hardware ECC recovered so there
is some difference. What do other non-failing drives say about those
values?
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 82
> 3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
> 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
> 8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
> 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17647
> 10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
> 11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
> 190 Airflow_Temperature_Cel 0x0022 124 124 000 Old_age Always - 38
> 194 Temperature_Celsius 0x0022 124 124 000 Old_age Always - 38
> 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
> 196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
> 200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
> 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
> 202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0
Hmmm... If the drive is failing FLUSHs, I would expect to see elevated
reallocation counters and maybe some pending counts. Aieee.. weird.
>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>> FLUSH_EXT timing out usually indicates that the drive is having problem
>> writing out what it has in its cache to the media. There was one case
>> where FLUSH_EXT timeout was caused by the driver failing to switch
>> controller back from NCQ mode before issuing FLUSH_EXT but that was on
>> sata_nv. There hasn't been any similar problem on sata_sil24.
>
> Hmm, I didn't noticed any data distortions, and if there where, they live
> on as copies in their new home..
It should have appeared as read errors. Maybe the drive successfully
wrote those sectors after 30+ secs timeout.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists