linux-kernel - Re: 2.6.24.3: regular sata drive resets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <47EEE4BF.5080609@gmail.com>
Date:	Sun, 30 Mar 2008 09:54:23 +0900
From:	Tejun Heo <htejun@...il.com>
To:	Hans-Peter Jansen <hpj@...la.net>
CC:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-ide@...r.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?

Hello,

Hans-Peter Jansen wrote:
>>>> Should I be worried? smartd doesn't show anything suspicious on those.
>> Can you please post the result of "smartctl -a /dev/sdX"?
> 
> Here's the last smart report from two of the offending drives. As noted 
> before, I did the hardware reorganization, replaced the dog slow 3ware 
> 9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives 
> for now, but a nephew already showed interest. What do you think, can I 
> cede those drives with a clear conscience? The Hardware_ECC_Recovered
> values are really worrisome, aren't they?

Different vendors use different scales for the raw values.  The value is 
still pegged at the highest so it could be those raw values are okay or 
that the vendor just doesn't update value field accordingly.  My P120 
says 0 for the raw value and 904635 for hardware ECC recovered so there 
is some difference.  What do other non-failing drives say about those 
values?

> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       82
>   3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       5952
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       23
>   5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
>   8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  Offline      -       0
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       17647
>  10 Spin_Retry_Count        0x0033   253   253   051    Pre-fail  Always       -       0
>  11 Calibration_Retry_Count 0x0012   253   002   000    Old_age   Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       19
> 190 Airflow_Temperature_Cel 0x0022   124   124   000    Old_age   Always       -       38
> 194 Temperature_Celsius     0x0022   124   124   000    Old_age   Always       -       38
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       162956700
> 196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   253   100   000    Old_age   Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
> 202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

Hmmm... If the drive is failing FLUSHs, I would expect to see elevated 
reallocation counters and maybe some pending counts.  Aieee.. weird.

>>>> It's been 4 samsung drives at all hanging on a sata sil 3124:
>> FLUSH_EXT timing out usually indicates that the drive is having problem
>> writing out what it has in its cache to the media.  There was one case
>> where FLUSH_EXT timeout was caused by the driver failing to switch
>> controller back from NCQ mode before issuing FLUSH_EXT but that was on
>> sata_nv.  There hasn't been any similar problem on sata_sil24.
> 
> Hmm, I didn't noticed any data distortions, and if there where, they live
> on as copies in their new home.. 

It should have appeared as read errors.  Maybe the drive successfully 
wrote those sectors after 30+ secs timeout.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/