lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64bb37e0710070144m6bc2c844oc96ef715b53b9819@mail.gmail.com>
Date:	Sun, 7 Oct 2007 10:44:25 +0200
From:	"Torsten Kaiser" <just.for.lkml@...glemail.com>
To:	"Tejun Heo" <htejun@...il.com>
Cc:	"Jeff Garzik" <jeff@...zik.org>, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org
Subject: Re: sata_sil24 broken since 2.6.23-rc4-mm1

On 10/5/07, Torsten Kaiser <just.for.lkml@...glemail.com> wrote:
> So I will use the weekend to see if I can find out who issues this
> command and add more debug to that place...

I added some DPRINTK to sil24_qc_issue and sil24_fill_sg, but I only
found one suspicious thing.

My sil24_fill_sg now looks like this:
static inline void sil24_fill_sg(struct ata_queued_cmd *qc,
                                 struct sil24_sge *sge)
{
        struct scatterlist *sg;

        ata_for_each_sg(sg, qc) {
                sge->addr = cpu_to_le64(sg_dma_address(sg));
                sge->cnt = cpu_to_le32(sg_dma_len(sg));
                if (ata_sg_is_last(sg, qc))
                        sge->flags = cpu_to_le32(SGE_TRM);
                else
                        sge->flags = 0;
                DPRINTK("flags,addr,cnt = 0x%x, 0x%X, 0x%X\n", sge->flags,
                        sge->addr, sge->cnt);
                sge++;
        }
}

Suspicious is, that *all* output from this DPRINTK shows flags as 0x0,
so the last sg is never terminated (SGE_TRM is 1<<31)?
But if that is the cause, how is this working at all? Or am I doing
something stupid?

Timing and outputs from five boots:
good:                                bad:
          more         moreboot                more
3->35     3->35        3->35         3->35     3->35
3->2a     2->35        2->35         3->2a     3->2a
3->setup  2->2a        2->2a         3->setup  3->setup
2->35     2->35        2->35         2->35     2->35
1->35     3->2a        3->2a         1->35     1->35
2->2a     3->setup     3->setup      2->2a     2->2a
1->2a     1->35        1->35         1->2a     1->2a
2->35     1->2a        1->2a         2->35
1->35     1->35        1->35                   1->35
3->int    3->int       3->int        3->int    3->int
3->35     3->35        3->35         3->35     3->35
          1->5DF/1439C 1->5DC/1439C            1->5DE/1439C
          2->5E0/143BC 2->5DE/143BC            2->5DF/143BC
          sg:170E      sg:1AAB                 sg:1A60
XXX:
5DD       5DF          5DC           5DF       5DE
5E0       5E0          5DE           5E0       5DF

The first three columns where working tries, the last two failed one drive.
column 1: ATA_DEBUG added, reboot
column 2: +my additions, reboot
column 3: +my additions, cold boot, wanted to make it fail, but worked
column 4: ATA_DEBUG added, cold boot
column 5: +my additions, cold boot
[x]->[y]: x is the ata-port, 1+2 on the sata_sil24, 3 on sata_nv with swncq
y:35 -> SYNCHRONIZE_CACHE commands that where send to the drive
y:2a -> WRITE_10 commands that where send to the drive
y:setup -> Debug from swncq: nv_swncq_dmafis: dma setup tag 0x0
y:int -> Debug from swncq: nv_swncq_host_interrupt: id 0x3 SWNCQ:
qc_active 0x1 ...

The lines before the XXX:
x->a/b: x is the ata-port, a the paddr from sil24_qc_issue, b the
activate from sil24_qc_issue
All outputs from sil24_qc_issue where identical in each boot sequence,
only differed from run to run.
sg:a: a is the sge->addr from sil24_fill_sg

The lines after the XXX:
This are the addresses that the XXX-printk from sil24_port_start prints.

I hope I explained enough what above table should mean.
This hole sequence (two syncs and one write to each drive) happens
between the output:
[   40.300000] md1: bitmap initialized from disk: read 10/10 pages, set 87 bits
[   40.320000] created bitmap (145 pages) for device md1
and the error on a bad boot:
[   70.680000] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
[   70.700000] ata2.00: cmd 61/08:00:09:d6:42/00:00:25:00:00/40 tag 0
cdb 0x0 data 4096 out
or if on a good boot:
[   40.910000] md: considering sdb1 ...
(sdb1 is part of another raid)

(If someone whats to complete bootlogs, just ask)

So now I have two questions:
1) What happens in sil24_fill_sg with SGE_TRM?
2) If that is ok, should I try to add debug to sil24_error_intr and/or
sil24_host_intr?

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ