linux-kernel - Re: sata_sil24 broken since 2.6.23-rc4-mm1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <64bb37e0710042306s6c629163gde7bc5c93973153e@mail.gmail.com>
Date:	Fri, 5 Oct 2007 08:06:11 +0200
From:	"Torsten Kaiser" <just.for.lkml@...glemail.com>
To:	"Matt Mackall" <mpm@...enic.com>
Cc:	"Tejun Heo" <htejun@...il.com>, "Jeff Garzik" <jeff@...zik.org>,
	linux-kernel@...r.kernel.org, akpm@...ux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/4/07, Matt Mackall <mpm@...enic.com> wrote:
> On Thu, Oct 04, 2007 at 07:32:52AM +0200, Torsten Kaiser wrote:
> > So now I'm rather out of ideas what to test... :(
>
> I'd give your previous bisect step another try.

Yes, I thought about that too. But I never seemed to need more than
two tries to make it fail.
So I would only suspect the last good step as wrong positive.
That would then point to the first of your maps2-patches, the moving
of the pagewalker code.
Would you thing that this is a plausible cause?

> Looking back at the thread a bit, anything that requires the machine
> to be off for more than a couple seconds to manifest stops looking
> like software and firmware and starts looking like a heat-related
> electrical or mechanical issue. Make sure your backups are current.

What backups? :-)

Yes, I also thought about hardware trouble, but the bisect result
seemed to consistent.
Also that its not always the same drive that fails, only every time
one of the sil-drives.

I now have activated ATA_DEBUG to see if the good and the bad boots differ.
It looks the same until the RAID5 starts.

Good boot:
[   40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.160000] ata_sg_setup: 1 sg elements mapped
[   40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.160000] ata_sg_setup: 1 sg elements mapped
[   40.160000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.160000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.320000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.320000] nv_swncq_sdbfis: over
[   40.320000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.320000] ata_exec_command: ata3: cmd 0xEA
[   40.390000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.390000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   40.420000] md: considering sdb1 ...
[   40.440000] md:  adding sdb1 ...
[   40.440000] md:  adding sda1 ...
[   40.450000] md: created md0
[   40.460000] md: bind<sda1>
[   40.470000] md: bind<sdb1>
[   40.480000] md: running: <sdb1><sda1>
[   40.500000] raid1: raid set md0 active with 2 out of 2 mirrors

Bad boot:
[   40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.060000] ata_sg_setup: 1 sg elements mapped
[   40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
[   40.060000] ata_sg_setup: 1 sg elements mapped
[   40.060000] ata_scsi_dump_cdb: CDB (2:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.200000] nv_swncq_host_interrupt: id 0x3 SWNCQ: qc_active 0x1
dhfis 0x1 dmafis 0x1 sactive 0x0
[   40.200000] nv_swncq_sdbfis: over
[   40.200000] ata_scsi_dump_cdb: CDB (3:0,0,0) 35 00 00 00 00 00 00 00 00
[   40.200000] ata_exec_command: ata3: cmd 0xEA
[   40.270000] ata_hsm_move: ata3: protocol 1 task_state 3 (dev_stat 0x40)
[   40.270000] ata_hsm_move: ata3: dev 0 command complete, drv_stat 0x40
[   70.060000] ata_scsi_timed_out: ENTER
[   70.060000] ata_scsi_timed_out: EXIT, ret=0
[   70.080000] ata_scsi_error: ENTER
[   70.080000] ata_port_flush_task: ENTER
[   70.100000] ata1: ata_port_flush_task: EXIT
[   70.110000] __ata_port_freeze: ata1 port frozen
[   70.220000] __ata_port_freeze: ata1 port frozen
[   70.230000] ata_eh_link_autopsy: ENTER
[   70.240000] ata_eh_link_autopsy: EXIT
[   70.250000] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.270000] ata1.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
[   70.270000]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask
0x4 (timeout)

After [   40.060000] ata_scsi_dump_cdb: CDB (1:0,0,0) 2a 00 25 42 d6 09 00 00 08
the drive sda falls of the earth and can't be recovered through soft-
or hard-resetting the port by the error handler.

So I will use the weekend to see if I can find out who issues this
command and add more debug to that place...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/