linux-kernel - Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101109092920.GA1542@arch.trippelsdorf.de>
Date:	Tue, 9 Nov 2010 10:29:20 +0100
From:	Markus Trippelsdorf <markus@...ppelsdorf.de>
To:	Thomas Hellstrom <thellstrom@...are.com>
Cc:	Jerome Glisse <j.glisse@...il.com>,
	"dri-devel@...ts.freedesktop.org" <dri-devel@...ts.freedesktop.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"airlied@...ux.ie" <airlied@...ux.ie>,
	Michel Dänzer <daenzer@...are.com>
Subject: Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer
 dereference

On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
> On 11/08/2010 09:53 PM, Jerome Glisse wrote:
> >On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
> ><markus@...ppelsdorf.de>  wrote:
> >>On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
> >>>On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
> >>>>On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf wrote:
> >>>>>I can trigger a kernel crash on my system by simply loading this png
> >>>>>image with firefox:
> >>>>>http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg
> >>>>Sorry the above link is wrong, this is the right one (that triggers the
> >>>>crash):
> >>>>http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png
> >>>I triggered it a few more times and took the attached picture.
> >>>It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
> >>>(Sorry for the bad picture quality)
> >>And here the same BUG in plaintext (should be a bit easier to read):
> >>
> >>Nov  8 19:28:23 arch kernel: ------------[ cut here ]------------
> >>Nov  8 19:28:23 arch kernel: kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:1628!
> >>
> >Thomas this bug seems to point to a case where we endup trying adding
> >an entry to
> >same offset in the rb tree for addr_space_mm. After reviewing
> >carefully the locking
> >around the rb tree modification&  addr_space_mm i am fairly confident
> >that no race can
> >occur. Would you have any idea on what might go wrong here ? I guess i would
> >ultimately need to dump mm&  rb tree state when BUG get trigger to try
> >to understand
> >states of things.
> 
> I agree there shouldn't be a race in this case.
> The locking around these operations is simple and straightforward.
> 
> So this IMHO should either be a memory corruption or a bug in the
> range manager. I've never seen this BUG trigger before. Dumping mm /
> rb tree contents or bisecting should probably find the culprit.

OK I've found the buggy commit by bisection:

e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit                                                                                     
commit e376573f7267390f4e1bdc552564b6fb913bce76                                                                                                      
Author: Michel Dänzer <daenzer@...are.com>                                                                                                           
Date:   Thu Jul 8 12:43:28 2010 +1000                                                                                                                
                                                                                                                                                     
    drm/radeon: fall back to GTT if bo creation/validation in VRAM fails.                                                                            
                                                                                                                                                     
    This fixes a problem where on low VRAM cards we'd run out of space for validation.                                                               
                                                                                                                                                     
    [airlied: Tested on my M7, Thinkpad T42, compiz works with no problems.]                                                                         
                                                                                                                                                     
    Signed-off-by: Michel Dänzer <daenzer@...are.com>                                                                                                
    Cc: stable@...nel.org                                                                                                                            
    Signed-off-by: Dave Airlie <airlied@...hat.com> 

Please note that this is an old commit from 2.6.36-rc. When I revert it the
kernel no longer crashes. Instead I see the following in my dmesg:

[TTM] Failed to find memory space for buffer 0xffff880113e10e48 eviction.
[TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
[TTM]   placement[0]=0x00070002 (1)
[TTM]     has_type: 1
[TTM]     use_type: 1
[TTM]     flags: 0x0000000A
[TTM]     gpu_offset: 0xA0000000
[TTM]     size: 131072
[TTM]     available_caching: 0x00070000
[TTM]     default_caching: 0x00010000
[TTM]  0x00000000-0x00000001:        1: used
[TTM]  0x00000001-0x00000011:       16: used
[TTM]  0x00000011-0x00000111:      256: used
[TTM]  0x00000111-0x00000211:      256: used
[TTM]  0x00000211-0x00000248:       55: free
[TTM]  0x00000248-0x0000024c:        4: used
[TTM]  0x0000024c-0x00001976:     5930: free
[TTM]  0x00001976-0x000021aa:     2100: used
[TTM]  0x000021aa-0x0000285f:     1717: free
[TTM]  0x0000285f-0x00002860:        1: used
[TTM]  0x00002860-0x00002873:       19: free
[TTM]  0x00002873-0x000029b3:      320: used
[TTM]  0x000029b3-0x00020000:   120397: free
[TTM]  total: 131072, used 2954 free 128118
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
...

And the following in the xorg log buffer:

Failed to alloc memory
Failed to allocat:
   size:     : 117555200 bytes
   alignment : 0 bytes
   domains   : 4
...

-- 
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/