[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4CD91D58.7080508@vmware.com>
Date: Tue, 09 Nov 2010 11:07:20 +0100
From: Thomas Hellstrom <thellstrom@...are.com>
To: Thomas Hellstrom <thellstrom@...are.com>
CC: Markus Trippelsdorf <markus@...ppelsdorf.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
Michel Danzer <daenzer@...are.com>
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
> On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
>> On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
>>> On 11/08/2010 09:53 PM, Jerome Glisse wrote:
>>>> On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
>>>> <markus@...ppelsdorf.de> wrote:
>>>>> On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
>>>>>> On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
>>>>>>> On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf
>>>>>>> wrote:
>>>>>>>> I can trigger a kernel crash on my system by simply loading
>>>>>>>> this png
>>>>>>>> image with firefox:
>>>>>>>> http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
>>>>>>>>
>>>>>>> Sorry the above link is wrong, this is the right one (that
>>>>>>> triggers the
>>>>>>> crash):
>>>>>>> http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
>>>>>>>
>>>>>> I triggered it a few more times and took the attached picture.
>>>>>> It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
>>>>>> (Sorry for the bad picture quality)
>>>>> And here the same BUG in plaintext (should be a bit easier to read):
>>>>>
>>>>> Nov 8 19:28:23 arch kernel: ------------[ cut here ]------------
>>>>> Nov 8 19:28:23 arch kernel: kernel BUG at
>>>>> drivers/gpu/drm/ttm/ttm_bo.c:1628!
>>>>>
>>>> Thomas this bug seems to point to a case where we endup trying adding
>>>> an entry to
>>>> same offset in the rb tree for addr_space_mm. After reviewing
>>>> carefully the locking
>>>> around the rb tree modification& addr_space_mm i am fairly confident
>>>> that no race can
>>>> occur. Would you have any idea on what might go wrong here ? I
>>>> guess i would
>>>> ultimately need to dump mm& rb tree state when BUG get trigger to
>>>> try
>>>> to understand
>>>> states of things.
>>> I agree there shouldn't be a race in this case.
>>> The locking around these operations is simple and straightforward.
>>>
>>> So this IMHO should either be a memory corruption or a bug in the
>>> range manager. I've never seen this BUG trigger before. Dumping mm /
>>> rb tree contents or bisecting should probably find the culprit.
>> OK I've found the buggy commit by bisection:
>>
>> e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
>> commit e376573f7267390f4e1bdc552564b6fb913bce76
>> Author: Michel Dänzer<daenzer@...are.com>
>> Date: Thu Jul 8 12:43:28 2010 +1000
>>
>> drm/radeon: fall back to GTT if bo creation/validation in VRAM
>> fails.
>>
>> This fixes a problem where on low VRAM cards we'd run out of
>> space for validation.
>>
>> [airlied: Tested on my M7, Thinkpad T42, compiz works with no
>> problems.]
>>
>> Signed-off-by: Michel Dänzer<daenzer@...are.com>
>> Cc: stable@...nel.org
>> Signed-off-by: Dave Airlie<airlied@...hat.com>
>>
>> Please note that this is an old commit from 2.6.36-rc. When I revert
>> it the
>> kernel no longer crashes. Instead I see the following in my dmesg:
>>
>
> Hmm, so this sounds like something in the Radeon eviction error path
> is causing corruption.
> I had a similar problem with vmwgfx, when I tried to unref a BO
> _after_ ttm_bo_init() failed.
> ttm_bo_init() is really supposed to call unref itself for various
> reasons, so calling unref() or kfree() after a failed ttm_bo_init()
> will cause corruption.
>
> In any case, the error below also suggests something is a bit fragile
> in the Radeon driver:
>
> First, an accelerated eviction may fail, like in the message below,
> but then there must always be a backup plan, like unaccelerated
> eviction to system. On BO creation, there are a number of placement
> strategies, but if all else fails, it should be possible to initially
> place the BO in system memory.
>
> Second, If bo validation fails during a command submission, due to
> insufficient VRAM / TT, then the driver should retry the complete
> validation cycle after first blocking all other validators and then
> evicting everything not pinned, to avoid failures due to fragmentation.
>
> /Thomas
>
Indeed, it seems like the commit you mention just retries ttm_bo_init()
after it previously failed. At that point the bo has been destroyed, so
that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly
documented in the function description. The reason for doing so is to
have a single path for freeing all BO resources already allocated on the
point of failure.
/Thomas
>
>> [TTM] Failed to find memory space for buffer 0xffff880113e10e48
>> eviction.
>> [TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
>> [TTM] placement[0]=0x00070002 (1)
>> [TTM] has_type: 1
>> [TTM] use_type: 1
>> [TTM] flags: 0x0000000A
>> [TTM] gpu_offset: 0xA0000000
>> [TTM] size: 131072
>> [TTM] available_caching: 0x00070000
>> [TTM] default_caching: 0x00010000
>> [TTM] 0x00000000-0x00000001: 1: used
>> [TTM] 0x00000001-0x00000011: 16: used
>> [TTM] 0x00000011-0x00000111: 256: used
>> [TTM] 0x00000111-0x00000211: 256: used
>> [TTM] 0x00000211-0x00000248: 55: free
>> [TTM] 0x00000248-0x0000024c: 4: used
>> [TTM] 0x0000024c-0x00001976: 5930: free
>> [TTM] 0x00001976-0x000021aa: 2100: used
>> [TTM] 0x000021aa-0x0000285f: 1717: free
>> [TTM] 0x0000285f-0x00002860: 1: used
>> [TTM] 0x00002860-0x00002873: 19: free
>> [TTM] 0x00002873-0x000029b3: 320: used
>> [TTM] 0x000029b3-0x00020000: 120397: free
>> [TTM] total: 131072, used 2954 free 128118
>> [drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object
>> (117555200, 4, 4096, -12)
>> radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
>> ...
>>
>> And the following in the xorg log buffer:
>>
>> Failed to alloc memory
>> Failed to allocat:
>> size: : 117555200 bytes
>> alignment : 0 bytes
>> domains : 4
>> ...
>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@...ts.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists