linux-kernel - Re: sata_sil24 broken since 2.6.23-rc4-mm1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64bb37e0710011100t2cd81a32g501435b98f783ba9@mail.gmail.com>
Date:	Mon, 1 Oct 2007 20:00:14 +0200
From:	"Torsten Kaiser" <just.for.lkml@...glemail.com>
To:	"Tejun Heo" <htejun@...il.com>
Cc:	"Jeff Garzik" <jeff@...zik.org>, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Torsten Kaiser <just.for.lkml@...glemail.com> wrote:
> On 9/30/07, Tejun Heo <htejun@...il.com> wrote:
> > Torsten Kaiser wrote:
> > > What I find kind of interessing is, that while I got three different
> > > error codes the cmd part of the output was always the same.
> >
> > That's NCQ write command.  You'll be using it a lot if you're rebuilding
> > md5.
>
> It's not rebuilding the RAID at that point.
> If one drive fails, I reboot into a "safe" kernel, fix the RAID and
> that way try the next boot with a clean RAID again.
>
> The error happens when the RAID is initialized, might be the first
> write into the superblock to mark it dirty/inuse that triggers the
> error.

To make that comment "cmd part of the output was always the same" more
clear: I did not only mean that the first (cmd-)byte was the same, but
the whole line:
cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0 cdb 0x0 data 4096 out

I did find ATA_CMD_FPDMA_WRITE = 0x61in ata.h, but I don't find a good
hint for the 08 that follows. The interpretation XFER_PIO_0 = 0x08
seems wrong...

> > It seems like something is going wrong with request DMA or sg
> > mapping.  Maybe some change in block/*.[hc]?
>
> The sg-chaining-patch stands out, but I have no conclusive proof that
> it really is the cause.
> As noted in this thread, a long time I thought that rc7 with the
> sg-chaining-patch was safe, but one time it also showed the error.

If I look at what patches remain, it seems that some other (earlier)
patch that is new in 2.6.23-rc4-mm1 is the trigger, but it will only
fail together with a second patch.

> > > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > > (line779): 2 good boots

Up to (incl.) fs-remove-some-aop_truncated_page.patch (line 956): 2 good boots
Up to (incl.) update-n_high_memory-node-state-for-memory-hotadd.patch (line978)
1 good boot, so I would tend to say that this patches are also OK, but
I will recheck this version.

Up to (incl.) maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch
(line 1038): 1 failed

> > > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > > 1045): 1 failed
> > > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed

Updated list of remaining, new patches in rc4:

> > > That means from the patches added to the rc4 variant of the mm-kernel
> > > the following are remaining:
> [snip]
> > > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > > update-n_high_memory-node-state-for-memory-hotadd.patch

And I think these are OK too.
So I think some earlier patch triggers bug together with one of these patches:

categorize-gfp-flags.patch
categorize-gfp-flags-fix.patch
make-swappiness-safer-to-use.patch
... these should not change any behaivor from there description

> flush-cache-before-*
> Don't think so: No ia64 system, unchanged from rc3

> # grouping pages by mobility patches
> ... no idee, but seem unchanged

> maps2.*
> Don't think that related...

Will continue to break this down...

> As for you printk:
> From two goot boots, I had not had any failures with it:
> First one:
> Sep 30 19:24:53 treogen [    3.810000] XXX sil24 cb=ffff810037ef0000
> cb_dma=37ef0000
> Sep 30 19:24:53 treogen [    3.820000] XXX sil24 cb=ffff810037f00000
> cb_dma=37f00000
> Second:
> Sep 30 20:06:22 treogen [    3.820000] XXX sil24 cb=ffff810037f00000
> cb_dma=37f00000
> Sep 30 20:06:22 treogen [    3.830000] XXX sil24 cb=ffff810037f10000
> cb_dma=37f10000

Next boot with this kernel was also good:
Oct  1 06:52:32 treogen [    3.740000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Oct  1 06:52:32 treogen [    3.750000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Next kernel (maps2-make-proc-pid-smaps-optional-under-config_embeddedpatch-fix.patch)
Oct  1 18:07:09 treogen [    3.760000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct  1 18:07:09 treogen [    3.770000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
-> failed, but when rebooting instead of a cold boot that kernel worked:
Oct  1 18:14:42 treogen [    3.800000] XXX sil24 cb=ffff810005df0000
cb_dma=5df0000
Oct  1 18:14:42 treogen [    3.810000] XXX sil24 cb=ffff810005e00000
cb_dma=5e00000
... same values, so thats not an indicator that trouble will follow.

Next kernel (update-n_high_memory-node-state-for-memory-hotadd.patch)
Oct  1 19:31:38 treogen [    3.840000] XXX sil24 cb=ffff81007fa60000
cb_dma=7fa60000
Oct  1 19:31:38 treogen [    3.850000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
that was also a good boot.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/