linux-kernel - Re: Problem with ata layer in 2.6.24

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1201580616.12795.2.camel@localhost>
Date:	Tue, 29 Jan 2008 05:23:36 +0100
From:	Kasper Sandberg <lkml@...anurb.dk>
To:	Gene Heskett <gene.heskett@...il.com>
Cc:	Mikael Pettersson <mikpe@...uu.se>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linux ide Mailing list <linux-ide@...r.kernel.org>
Subject: Re: Problem with ata layer in 2.6.24

On Mon, 2008-01-28 at 11:35 -0500, Gene Heskett wrote:
> On Monday 28 January 2008, Mikael Pettersson wrote:
> >Gene Heskett writes:
> > > On Monday 28 January 2008, Peter Zijlstra wrote:
> > > >On Mon, 2008-01-28 at 09:17 +0100, Mikael Pettersson wrote:
> > > >> 1. Wrong mailing list; use linux-ide (@vger) instead.
> > > >
> > > >What, and keep all us other interested people in the dark?
> > >
> > > As a test, I tried rebooting to the latest fedora kernel and found it
> > > kills X, so I'm back to the second to last fedora version ATM, and the
> > > third 'smartctl -t lng /dev/sda' in 24 hours is running now.  The first
> > > two completed with no errors.
> > >
> > > I've added the linux-ide list to refresh those people of the problem,
> > > the logs are being spammed by this message stanza:
> > >
> > >  Jan 28 04:46:25 coyote kernel: [26550.290016] ata1.00: exception Emask
> > > 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 28 04:46:25 coyote kernel:
> > > [26550.290028] ata1.00: cmd 35/00:58:c9:9c:0a/00:01:00:00:00/e0 tag 0 dma
> > > 176128 out Jan 28 04:46:25 coyote kernel: [26550.290029]          res
> > > 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 28 04:46:25
> > > coyote kernel: [26550.290032] ata1.00: status: { DRDY } Jan 28 04:46:25
> > > coyote kernel: [26550.290060] ata1: soft resetting link Jan 28 04:46:25
> > > coyote kernel: [26550.452301] ata1.00: configured for UDMA/100 Jan 28
> > > 04:46:25 coyote kernel: [26550.452318] ata1: EH complete
> > > Jan 28 04:46:25 coyote kernel: [26550.455898] sd 0:0:0:0: [sda] 390721968
> > > 512-byte hardware sectors (200050 MB) Jan 28 04:46:25 coyote kernel:
> > > [26550.456151] sd 0:0:0:0: [sda] Write Protect is off Jan 28 04:46:25
> > > coyote kernel: [26550.456403] sd 0:0:0:0: [sda] Write cache: enabled,
> > > read cache: enabled, doesn't support DPO or FUA
> >
> >It's not obvious from this incomplete dmesg log what HW or driver
> >is behind ata1, but if the 2.6.24-rc7 kernel matches the 2.6.24 one,
> >
> >it should be pata_amd driving a WDC disk:
> > > [   30.702887] pata_amd 0000:00:09.0: version 0.3.10
> > > [   30.703052] PCI: Setting latency timer of device 0000:00:09.0 to 64
> > > [   30.703188] scsi0 : pata_amd
> > > [   30.709313] scsi1 : pata_amd
> > > [   30.710076] ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xf000
> > > irq 14 [   30.710079] ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma
> > > 0xf008 irq 15 [   30.864753] ata1.00: ATA-6: WDC WD2000JB-00EVA0,
> > > 15.05R15, max UDMA/100 [   30.864756] ata1.00: 390721968 sectors, multi
> > > 16: LBA48
> > > [   30.871629] ata1.00: configured for UDMA/100
> >
> >Unfortunately we also see:
> > > [   48.285456] nvidia: module license 'NVIDIA' taints kernel.
> > > [   48.549725] ACPI: PCI Interrupt 0000:02:00.0[A] -> Link [APC4] -> GSI
> > > 19 (level, high) -> IRQ 20 [   48.550149] NVRM: loading NVIDIA UNIX x86
> > > Kernel Module  169.07  Thu Dec 13 18:42:56 PST 2007
> >
> >We have no way of debugging that module, so please try 2.6.24 without it.
> 
> Sorry, I can't do this and have a working machine.  The nv driver has suffered 
> bit rot or something since the FC2 days when it COULD run a 19" crt at 
> 1600x1200, and will not drive this 20" wide screen lcd 1680x1050 monitor at 
> more than 800x600, which is absolutely butt ugly fuzzy, looking like a jpg 
> compressed to 10%.  The system is not usable on a day to basis without the 
> nvidia driver.
> 
> Fix the nv driver so it will run this screen at its native resolution and I'll 
> be glad to run it even if it won't run google earth, which I do use from time 
> to time.  Now, if in all the hits you can get from google on this, currently 
> 14,800 just for 'exception Emask', apparently caused by a timeout, if 100% of 
> the complainers are running nvidia drivers also, then I see a legit 
I can invalidate this theory...
i helped a guy on irc debug this problem, and he had ati. I tried having
him stop using fglrx, and go to r300.. same problem, and same problem
even with vesa.. :)

also, i have this on my fileserver with .20, which doesent even run X,
or module support in kernel :)

> complaint.  Again, fix the nv driver so it will run my screen & I'll be glad 
> to switch.  I can see the reason, sure, but the machine must be capable of 
> doing its common day to day stuff, while using that driver, like running kde 
> for kmail, and browsers that work.
> 
> >If the problems persist, please try to capture a complete log from the
> >failing kernel -- the interesting bits are everything from initial boot
> >up to and including the first few errors. You may need to increase the
> >kernel's log buffer size if the log gets truncated (CONFIG_LOG_BUF_SHIFT).
> 
> If by log you mean /var/log/messages, I have several megabytes of those.
> If you mean a live dmesg capture taken right now, its attached. It contains 
> several of these at the bottom.  I long ago made the kernel log buffer 
> bigger, cuz it couldn't even show the start immediately after the boot, and 
> even the dump to syslog was truncated.
> 
> >There are no pata_amd changes from 2.6.24-rc7 to 2.6.24 final.
> 
> That is what I was afraid of.  I've done some limited grepping in that branch 
> of the kernel tree, and cannot seem to locate where this EH handler is being 
> invoked from.
> 
> There is 2 lines of interest in the dmesg:
> 
> [    0.000000] Nvidia board detected. Ignoring ACPI timer override.
> [    0.000000] If you got timer trouble try acpi_use_timer_override
> 
> But I have NDI what it means, kernel argument/xconfig option?
> 
> I've also done some googling, and it appears this problem is fairly widespread 
> since the switchover to libata was encouraged.  A stock fedora F8 kernel 
> suffers the same freezes and eventually locks up, but does it without the 
> error messages being logged, it just freezes, feeling identical to this in 
> the minutes before the total freeze.  I've tried 2 of those too, but the 
> newest one won't even run X.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/