linux-kernel - BUGHUNTING: sata_sil, Silicon Image 3114 controllers, 2.6.18 and numerous errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <op.tgu2zhvzxci36i@akima>
Date:	Tue, 03 Oct 2006 20:55:55 +0100
From:	"Jonathan Bell" <doggs.lay.eggs@...glemail.com>
To:	linux-kernel@...r.kernel.org
Subject: BUGHUNTING: sata_sil, Silicon Image 3114 controllers, 2.6.18 and numerous errors

Hello,

This is a long-winded explanation of events so I'll try to keep it concise.

I am running the following system (a large file server):

Athlon XP 2600+ CPU
1GB PC2700 RAM
A7N8X (nForce2) motherboard, BIOS 1008 (latest)
2x Silicon Image SiI3114 SATA controllers, output of LSPCI in [1]lspci.gz
D-Link DG530-T Gigabit PCI network card
Twinhan VisionPlus VisionDTV TV card

Drives =

2xWD3000JD 300GB western digital
2xWD3200JD 320GB western digital

3x6L300S0 300GB maxtor
3x6B300S0 300GB maxtor (identical to above but slightly older firmware and  
non-RoHS compliant)

The mess of problems began with _all_ these drives whilst using kernel  
2.6.15-26-k7 in the Ubuntu Dapper distribution. I got hit with a bug  
concerning FUA:

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/b0c495e4cf9d6d2/6ac40ae91be51b23?lnk=st&q=libpata+code+issues&rnum=1#6ac40ae91be51b23

At this stage my testing methodology was the following:

1) Make filesystems (reiserfs) on each of the drives
2) Make a huge file (11GB) by piping the output of /bin/yes "0123456789"  
to a file
3) Sync the disks
4) Calculate MD5 checksum of the file
5) Copy hugefile across to the next drive
6) Calculate MD5sum on new file

The writing of the hugefile to EACH drive would fill up the kernel log  
with errors seen all over the above linked thread, about one every 20  
seconds and the MD5sums of the files copied were different.

Noting that the problem occured on both the maxtor drives (with much more  
severity leading to device resets) and the western digitals I manually  
installed the 2.6.18 kernel from kernel.org which disabled FUA by default.  
This made the errors "disappear" on the maxtor drives but the western  
digital drives still displayed errors the same as the ones before.

Because I am naiive I went out and purchased another four drives, Seagates  
this time. I replaced the Western Digitals in the machine with the  
following model: 4xST3250824NS (250GB) and started testing again. This  
time I get errors like:

[ 1876.112335] attempt to access beyond end of device
[ 1876.112429] sdc1: rw=0, want=4517265416, limit=488392002
[ 1876.112516] attempt to access beyond end of device
[ 1876.112588] sdc1: rw=0, want=2110783496, limit=488392002
[ 1876.112723] attempt to access beyond end of device
[ 1876.112793] sdc1: rw=0, want=4656529416, limit=488392002
[ 1876.122250] attempt to access beyond end of device
[ 1876.122339] sdc1: rw=0, want=4517265416, limit=488392002

and even a kernel Oops: [2]reiseroops.gz

These errors occured after I copied hugefile to a destination drive and  
during the calculation of the md5sum, i.e. upon reading the new file.

Digging deeper, I am now at the stage where I can report the following:

The errors are happening independent of the controller hardware. One  
brand-new controller and one used and verified working (on Windows) with  
the same errors happening in each case.
The errors are independent of type of drive - Seagates and Maxtors both  
exhibit the same errors.
The errors are independent of filesystem - I tried and got the same result  
with ext2, ext3 and reiserfs 3.6.
The errors ONLY occur when reading from newly-created files. I am  
currently badblocks -n'ing the drives which will obviously take some time  
on drives this large in order to find out if this does happen with simple  
block read/writes.

Also I can say this:

The hugefile copied from the FIRST drive on one controller to the FIRST  
drive on the other controller exhibited NO ERRORS in either direction. By  
this I mean a Seagate attached to port0 and a Maxtor attached to port0.  
[3]dmesg-detection.gz may help with this - it is the kernel's detection of  
the drives. Whether this is due to dumb luck or a quirk in this "bug", I  
don't know but I will keep trying to make this error happen on the first  
drives in the system.

The hugefile copied from the FIRST Seagate drive to the SECOND, THIRD and  
FOURTH Seagate drives all make md5sum say "input/output error" and  
associated "access beyond end of device" errors in dmesg. The same thing  
happens when I copy from the FIRST maxtor drive to the second and third  
(not fourth as it contains NTFS data).

I will keep copying back and forth between drives in an effort to map out  
what is causing the error, but I'm going to need some pointers to track  
this to the source.

Any help appreciated,
Jonathan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/