linux-kernel - Re: Problem with ata layer in 2.6.24

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <200801281156.28844.gene.heskett@gmail.com>
Date:	Mon, 28 Jan 2008 11:56:28 -0500
From:	Gene Heskett <gene.heskett@...il.com>
To:	Mikael Pettersson <mikpe@...uu.se>
Cc:	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Linux ide Mailing list <linux-ide@...r.kernel.org>
Subject: Re: Problem with ata layer in 2.6.24

On Monday 28 January 2008, Gene Heskett wrote:
While reading this msg as it came back, I locked up again and rebooted to 
2.6.24, and got lucky (maybe) as the attached dmesg will show quite a few 
instances of this LOOOONNNGG before the nvidia driver is loaded to taint the 
kernel.  Have fun guys!
 
>On Monday 28 January 2008, Mikael Pettersson wrote:
>>Gene Heskett writes:
>> > On Monday 28 January 2008, Peter Zijlstra wrote:
>> > >On Mon, 2008-01-28 at 09:17 +0100, Mikael Pettersson wrote:
>> > >> 1. Wrong mailing list; use linux-ide (@vger) instead.
>> > >
>> > >What, and keep all us other interested people in the dark?
>> >
>> > As a test, I tried rebooting to the latest fedora kernel and found it
>> > kills X, so I'm back to the second to last fedora version ATM, and the
>> > third 'smartctl -t lng /dev/sda' in 24 hours is running now.  The first
>> > two completed with no errors.
>> >
>> > I've added the linux-ide list to refresh those people of the problem,
>> > the logs are being spammed by this message stanza:
>> >
>> >  Jan 28 04:46:25 coyote kernel: [26550.290016] ata1.00: exception Emask
>> > 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Jan 28 04:46:25 coyote kernel:
>> > [26550.290028] ata1.00: cmd 35/00:58:c9:9c:0a/00:01:00:00:00/e0 tag 0
>> > dma 176128 out Jan 28 04:46:25 coyote kernel: [26550.290029]         
>> > res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 28
>> > 04:46:25 coyote kernel: [26550.290032] ata1.00: status: { DRDY } Jan 28
>> > 04:46:25 coyote kernel: [26550.290060] ata1: soft resetting link Jan 28
>> > 04:46:25 coyote kernel: [26550.452301] ata1.00: configured for UDMA/100
>> > Jan 28 04:46:25 coyote kernel: [26550.452318] ata1: EH complete
>> > Jan 28 04:46:25 coyote kernel: [26550.455898] sd 0:0:0:0: [sda]
>> > 390721968 512-byte hardware sectors (200050 MB) Jan 28 04:46:25 coyote
>> > kernel: [26550.456151] sd 0:0:0:0: [sda] Write Protect is off Jan 28
>> > 04:46:25 coyote kernel: [26550.456403] sd 0:0:0:0: [sda] Write cache:
>> > enabled, read cache: enabled, doesn't support DPO or FUA
>>
>>It's not obvious from this incomplete dmesg log what HW or driver
>>is behind ata1, but if the 2.6.24-rc7 kernel matches the 2.6.24 one,
>>
>>it should be pata_amd driving a WDC disk:
>> > [   30.702887] pata_amd 0000:00:09.0: version 0.3.10
>> > [   30.703052] PCI: Setting latency timer of device 0000:00:09.0 to 64
>> > [   30.703188] scsi0 : pata_amd
>> > [   30.709313] scsi1 : pata_amd
>> > [   30.710076] ata1: PATA max UDMA/133 cmd 0x1f0 ctl 0x3f6 bmdma 0xf000
>> > irq 14 [   30.710079] ata2: PATA max UDMA/133 cmd 0x170 ctl 0x376 bmdma
>> > 0xf008 irq 15 [   30.864753] ata1.00: ATA-6: WDC WD2000JB-00EVA0,
>> > 15.05R15, max UDMA/100 [   30.864756] ata1.00: 390721968 sectors, multi
>> > 16: LBA48
>> > [   30.871629] ata1.00: configured for UDMA/100
>>
>>Unfortunately we also see:
>> > [   48.285456] nvidia: module license 'NVIDIA' taints kernel.
>> > [   48.549725] ACPI: PCI Interrupt 0000:02:00.0[A] -> Link [APC4] -> GSI
>> > 19 (level, high) -> IRQ 20 [   48.550149] NVRM: loading NVIDIA UNIX x86
>> > Kernel Module  169.07  Thu Dec 13 18:42:56 PST 2007
>>
>>We have no way of debugging that module, so please try 2.6.24 without it.
>
>Sorry, I can't do this and have a working machine.  The nv driver has
> suffered bit rot or something since the FC2 days when it COULD run a 19"
> crt at 1600x1200, and will not drive this 20" wide screen lcd 1680x1050
> monitor at more than 800x600, which is absolutely butt ugly fuzzy, looking
> like a jpg compressed to 10%.  The system is not usable on a day to basis
> without the nvidia driver.
>
>Fix the nv driver so it will run this screen at its native resolution and
> I'll be glad to run it even if it won't run google earth, which I do use
> from time to time.  Now, if in all the hits you can get from google on
> this, currently 14,800 just for 'exception Emask', apparently caused by a
> timeout, if 100% of the complainers are running nvidia drivers also, then I
> see a legit complaint.  Again, fix the nv driver so it will run my screen &
> I'll be glad to switch.  I can see the reason, sure, but the machine must
> be capable of doing its common day to day stuff, while using that driver,
> like running kde for kmail, and browsers that work.
>
>>If the problems persist, please try to capture a complete log from the
>>failing kernel -- the interesting bits are everything from initial boot
>>up to and including the first few errors. You may need to increase the
>>kernel's log buffer size if the log gets truncated (CONFIG_LOG_BUF_SHIFT).
>
>If by log you mean /var/log/messages, I have several megabytes of those.
>If you mean a live dmesg capture taken right now, its attached. It contains
>several of these at the bottom.  I long ago made the kernel log buffer
>bigger, cuz it couldn't even show the start immediately after the boot, and
>even the dump to syslog was truncated.
>
>>There are no pata_amd changes from 2.6.24-rc7 to 2.6.24 final.
>
>That is what I was afraid of.  I've done some limited grepping in that
> branch of the kernel tree, and cannot seem to locate where this EH handler
> is being invoked from.
>
>There is 2 lines of interest in the dmesg:
>
>[    0.000000] Nvidia board detected. Ignoring ACPI timer override.
>[    0.000000] If you got timer trouble try acpi_use_timer_override
>
>But I have NDI what it means, kernel argument/xconfig option?
>
>I've also done some googling, and it appears this problem is fairly
> widespread since the switchover to libata was encouraged.  A stock fedora
> F8 kernel suffers the same freezes and eventually locks up, but does it
> without the error messages being logged, it just freezes, feeling identical
> to this in the minutes before the total freeze.  I've tried 2 of those too,
> but the newest one won't even run X.



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Deprive a mirror of its silver and even the Czar won't see his face.

View attachment "dmesg" of type "text/plain" (43090 bytes)