linux-kernel - Re: sata_sil24 broken since 2.6.23-rc4-mm1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <64bb37e0709301139h456a82d6u98630a4d1503eaf@mail.gmail.com>
Date:	Sun, 30 Sep 2007 20:39:44 +0200
From:	"Torsten Kaiser" <just.for.lkml@...glemail.com>
To:	"Tejun Heo" <htejun@...il.com>
Cc:	"Jeff Garzik" <jeff@...zik.org>, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 9/30/07, Tejun Heo <htejun@...il.com> wrote:
> Torsten Kaiser wrote:
> > What I find kind of interessing is, that while I got three different
> > error codes the cmd part of the output was always the same.
>
> That's NCQ write command.  You'll be using it a lot if you're rebuilding
> md5.

It's not rebuilding the RAID at that point.
If one drive fails, I reboot into a "safe" kernel, fix the RAID and
that way try the next boot with a clean RAID again.

The error happens when the RAID is initialized, might be the first
write into the superblock to mark it dirty/inuse that triggers the
error.

> It seems like something is going wrong with request DMA or sg
> mapping.  Maybe some change in block/*.[hc]?

The sg-chaining-patch stands out, but I have no conclusive proof that
it really is the cause.
As noted in this thread, a long time I thought that rc7 with the
sg-chaining-patch was safe, but one time it also showed the error.

> > It's not just 2.6.23-rc4-mm1. All -mm's after rc4 are broken for me.
> > Confirmed breakage on -rc4-mm1, -rc6-mm1 and -rc8-mm1. I'm just
> > narrowing on rc4-mm1 because that was the first version to break.
> >
> > I'm currently trying to bisect 2.6.23-rc4-mm1. Here is the current status:
>
> Have you tested 2.6.23-rc4 without mm patches?  It could be something
> introduced between -rc3 and 4.

Not directly, but I have 4 good boots with one part of the mm-patches.
So I would tend to say that mainline 2.6.23-rc4 does not have this
bug.

> > [the 2.6.23-rc4-mm1 series-file has 2013 lines]
> > Up to (incl.) x86_64-convert-to-clockevents.patch (line 747): 2 good boots
> > Up to (incl.) x86_64-cleanup-struct-irqaction-initializers-patch
> > (line779): 2 good boots
> > Up to (incl.) slub-optimize-cacheline-use-for-zeroing.patch (line
> > 1045): 1 failed
> > Up to (incl.) fix-discrepancy-between-vdso-based... (line1461): 1 good, 1 failed
> >
> > Next try: up to patch fs-remove-some-aop_truncated_page.patch

Looks more like this is OK too.

> > That means from the patches added to the rc4 variant of the mm-kernel
> > the following are remaining:
[snip]
> > memoryless-nodes-add-n_cpu-node-state-move-setup-of-n_cpu-node-state-mask.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix.patch
> > memoryless-nodes-fixup-uses-of-node_online_map-in-generic-code-fix-2.patch
> > update-n_high_memory-node-state-for-memory-hotadd.patch
> > slub-avoid-page-struct-cacheline-bouncing-due-to-remote-frees-to-cpu-slab.patch
> > slub-do-not-use-page-mapping.patch
> > slub-move-page-offset-to-kmem_cache_cpu-offset.patch
> > slub-avoid-touching-page-struct-when-freeing-to-per-cpu-slab.patch
> > slub-place-kmem_cache_cpu-structures-in-a-numa-aware-way.patch
> > slub-optimize-cacheline-use-for-zeroing.patch
> >
> > But due to the unreliable nature of the bug, I can't be to sure about that.
>
> Yeah, that's what I'm worried about.  Bisection is extremely difficult
> if errors are intermittent and takes long time to reproduce.

Yes...

As for the remaining patches:
memoryless-nodes-*
Don't think so: I do have a NUMA system, but both nodes have memory.
flush-cache-before-*
Don't think so: No ia64 system, unchanged from rc3
# grouping pages by mobility patches
... no idee, but seem unchanged
maps2.*
Don't think that related...
remaining slub-* patches
Might be...

As for you printk:
>From two goot boots, I had not had any failures with it:
First one:
Sep 30 19:24:53 treogen [    3.810000] XXX sil24 cb=ffff810037ef0000
cb_dma=37ef0000
Sep 30 19:24:53 treogen [    3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Second:
Sep 30 20:06:22 treogen [    3.820000] XXX sil24 cb=ffff810037f00000
cb_dma=37f00000
Sep 30 20:06:22 treogen [    3.830000] XXX sil24 cb=ffff810037f10000
cb_dma=37f10000

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/