linux-kernel - Re: 2.6.33: pci 0000:00:00.0: address space collision / spontaenous reboots

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.1003121700290.6929@p34.internal.lan>
Date:	Fri, 12 Mar 2010 17:07:02 -0500 (EST)
From:	Justin Piszcz <jpiszcz@...idpixels.com>
To:	Bjorn Helgaas <bjorn.helgaas@...com>
cc:	Yinghai Lu <yinghai@...nel.org>, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org
Subject: Re: 2.6.33: pci 0000:00:00.0: address space collision / spontaenous
 reboots



On Fri, 12 Mar 2010, Bjorn Helgaas wrote:

> On Friday 12 March 2010 01:32:17 pm Justin Piszcz wrote:
>
>>>> Even with all boards removed:
>>>> [    0.133537] pci 0000:00:00.0: address space collision: [mem
>>>> 0xe0000000-0xffffffff 64bit] already in use
>>>>
>>>> 00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual slot
>>>> PCI-e_GFX and HT3 K8 part
>>>
>>> how about current linus' tree with pci=nocrs or pci=use_crs?
>>
>> Hi, I saw your second e-mail, so it sounds like a bad board or something
>> that Linux does not have a quirk for yet, but in any case, per your
>> recommendations:
>>
>> pci=nocrs:
>> http://home.comcast.net/~jpiszcz/20100312/dmesg-pci-nocrs.txt
>>
>> pci=use_crs:
>> http://home.comcast.net/~jpiszcz/20100312/dmesg-use-crs.txt
>>
>> No collision when pci=use_crs is used, BUT the system still crashes.
>>
>> Instead of collision, it says this:
>>
>> [    0.133598] PCI: pci_cache_line_size set to 64 bytes
>> [    0.133603] pci 0000:00:00.0: BAR 3: reserving [mem 0xe0000000-0xffffffff flags 0x120204] (d=0, p=0)
>> [    0.133606] pci 0000:00:00.0: no compatible bridge window for [mem 0xe0000000-0xffffffff 64bit]
>> [    0.133610] pci 0000:00:00.0: can't reserve [mem 0xe0000000-0xffffffff 64bit]
>> [    0.133617] pci 0000:00:11.0: BAR 0: reserving [io  0xff00-0xff07 flags 0x20101] (d=0, p=0)
>>
>> [    0.133735] Expanded resource reserved due to conflict with PCI Bus 0000:00
>
> Let's look at some of these messages:
>
>  pci_root PNP0A03:00: host bridge window [mem 0x40000000-0xfed0ffff]
>
> That looks normal to me.  If you could boot a current upstream kernel,
> e.g., 2.6.34-rc1, I think it might print more information about your
> AMD PCI address space routing.  BTW, it looks like you have four CPUs,
> but your kernel is only compiled to support two.
The latest e-mail shows similar messages (2.6.34-rc1).

>
>  pci 0000:00:00.0: reg 1c: [mem 0xe0000000-0xffffffff 64bit]
>  pci 0000:00:00.0: no compatible bridge window for [mem 0xe0000000-0xffffffff 64bit]
>  pci 0000:00:00.0: can't reserve [mem 0xe0000000-0xffffffff 64bit]
>
> These are just telling us that the device BAR 0xe0000000-0xffffffff
> doesn't fit inside the bridge window of 0x40000000-0xfed0ffff.  I don't
> know why the device has that weird-looking BAR, but that by itself
> shouldn't be fatal because we don't have any drivers that try to use
> that BAR.
OK- btw, keep in mind all boards have been removed from the system, also,
the serial port, 1394, some other things, floppy, etc, have been disabled
in the motherboard, to free up IRQs if that was the cause, no difference.
Also tried many pci= options, noapic, acpi=off, nothing helps.

>
>  Expanded resource reserved due to conflict with PCI Bus 0000:00
>
> This comes from e820_reserve_resources_late().  I wish it were a
> more useful message and showed the actual conflict and what was
> expanded, but I don't think it's a problem in itself.
Ok..

>
>  pnp 00:0a: disabling [mem 0x000f0000-0x000f3fff] because it overlaps 0000:00:00.0 BAR 3 [mem 0x00000000-0x1fffffff 64bit]
>
> We failed to reserve the 0xe0000000-0xffffffff region above, so we just
> cleared out the resource.  It keeps the same size, so it ends up at
> 0x00000000-0x1fffffff, where it appears to conflict with a lot of PNP
> devices.  But this isn't a real conflict; it's just Linux being stupid
> because we don't handle that PCI resource correctly.
Ok..

>
> So the messages *look* alarming, but I don't see anything there that
> should cause a spontaneous reboot.
The system stays up for 5min, 10min, 1-2hrs sometimes and then the box
will reboot, even with various kernel debugging enabled, nothing is captured,
have not setup netconsole for this server yet, but I don't think that would
get anything either due to how this error occurs.  It is a brand new 
motherboard/memory/etc.  What is interesting is running stress, there are
no issues, but I was able to make it crash by reading all of the drives
on the system and running lilo at the same time, that was the only time I
made it crash on-demand, or "reboot"- as there are no logs/etc of the crash.
>
> Is this a regression?  Did the system ever work reliably with any
> Linux kernel?  If not, I'd suspect a hardware problem like bad memory.
The memory has been tested, latest memtest from the latest System Rescue
CD, it has 1 stick of memory (1GB), it passed the memory test successfuly,
there were no errors.

>
> Bjorn

Thanks for the response..

Justin.