[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200803300114.40096.hpj@urpla.net>
Date: Sun, 30 Mar 2008 01:14:39 +0100
From: Hans-Peter Jansen <hpj@...la.net>
To: Tejun Heo <htejun@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org, linux-ide@...r.kernel.org
Subject: Re: 2.6.24.3: regular sata drive resets - worrisome?
Hi Tejun,
thanks for picking this issue up.
Am Samstag, 29. März 2008 schrieb Tejun Heo:
> Hello, Hans.
>
> Andrew Morton wrote:
> >> since I upgraded to 2.6.24.3 on one of my production systems, I see
> >> regular device resets like these:
> >>
> >> Mar 20 14:33:03 lisa5 kernel: ata2.00: exception Emask 0x0 SAct 0x0
> >> SErr 0x0 action 0x2 frozen Mar 20 14:33:03 lisa5 kernel: ata2.00: cmd
> >> ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 Mar 20 14:33:03 lisa5
> >> kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4
> >> (timeout)
>
> Ouch, timeout on FLUSH_EXT. Are all errors on cmd ea?
>
> >> Should I be worried? smartd doesn't show anything suspicious on those.
>
> Can you please post the result of "smartctl -a /dev/sdX"?
Here's the last smart report from two of the offending drives. As noted
before, I did the hardware reorganization, replaced the dog slow 3ware
9500S-8 and the SiI 3124 with a single Areca 1130 and retired the drives
for now, but a nephew already showed interest. What do you think, can I
cede those drives with a clear conscience? The Hardware_ECC_Recovered
values are really worrisome, aren't they?
sdc:
smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number: S09QJ1GYA03006
Firmware Version: VT100-33
User Capacity: 250.059.350.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Sun Mar 23 01:13:37 2008 CET
==> WARNING: May need -F samsung3 enabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (4866) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 81) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 82
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17647
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
190 Airflow_Temperature_Cel 0x0022 124 124 000 Old_age Always - 38
194 Temperature_Celsius 0x0022 124 124 000 Old_age Always - 38
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162956700
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17624 -
# 2 Short offline Completed without error 00% 17601 -
# 3 Short offline Completed without error 00% 17577 -
# 4 Short offline Completed without error 00% 17553 -
# 5 Short offline Completed without error 00% 17528 -
# 6 Short offline Completed without error 00% 17504 -
# 7 Extended offline Completed without error 00% 17489 -
# 8 Short offline Completed without error 00% 17480 -
# 9 Short offline Completed without error 00% 17456 -
#10 Short offline Completed without error 00% 17432 -
#11 Short offline Completed without error 00% 17408 -
#12 Short offline Completed without error 00% 17384 -
#13 Short offline Completed without error 00% 17360 -
#14 Short offline Completed without error 00% 17336 -
#15 Extended offline Completed without error 00% 17320 -
#16 Short offline Completed without error 00% 17311 -
#17 Short offline Completed without error 00% 17287 -
#18 Short offline Completed without error 00% 17263 -
#19 Short offline Completed without error 00% 17239 -
SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
sdd:
smartctl version 5.38 [i686-suse-linux-gnu] Copyright (C) 2002-7 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2504C
Serial Number: S09QJ1GYA03003
Firmware Version: VT100-33
User Capacity: 250.059.350.016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Sun Mar 23 01:13:38 2008 CET
==> WARNING: May need -F samsung3 enabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (4836) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 80) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 79
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 5952
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 23
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 17648
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19
190 Airflow_Temperature_Cel 0x0022 118 118 000 Old_age Always - 40
194 Temperature_Celsius 0x0022 118 118 000 Old_age Always - 40
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 162520674
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 253 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 TA_Increase_Count 0x0032 253 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 17626 -
# 2 Short offline Completed without error 00% 17602 -
# 3 Short offline Completed without error 00% 17578 -
# 4 Short offline Completed without error 00% 17554 -
# 5 Short offline Completed without error 00% 17530 -
# 6 Short offline Completed without error 00% 17506 -
# 7 Extended offline Completed without error 00% 17490 -
# 8 Short offline Completed without error 00% 17482 -
# 9 Short offline Completed without error 00% 17457 -
#10 Short offline Completed without error 00% 17433 -
#11 Short offline Completed without error 00% 17409 -
#12 Short offline Completed without error 00% 17385 -
#13 Short offline Completed without error 00% 17361 -
#14 Short offline Completed without error 00% 17337 -
#15 Extended offline Completed without error 00% 17321 -
#16 Short offline Completed without error 00% 17313 -
#17 Short offline Completed without error 00% 17289 -
#18 Short offline Completed without error 00% 17264 -
#19 Short offline Completed without error 00% 17240 -
SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
> >> It's been 4 samsung drives at all hanging on a sata sil 3124:
>
> FLUSH_EXT timing out usually indicates that the drive is having problem
> writing out what it has in its cache to the media. There was one case
> where FLUSH_EXT timeout was caused by the driver failing to switch
> controller back from NCQ mode before issuing FLUSH_EXT but that was on
> sata_nv. There hasn't been any similar problem on sata_sil24.
Hmm, I didn't noticed any data distortions, and if there where, they live
on as copies in their new home..
Thanks,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists