[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <202106112023.11270.linux@zary.sk>
Date: Fri, 11 Jun 2021 20:23:10 +0200
From: Ondrej Zary <linux@...y.sk>
To: Christian König <christian.koenig@....com>
Cc: Ben Skeggs <bskeggs@...hat.com>, dri-devel@...ts.freedesktop.org,
nouveau@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
On Friday 11 June 2021 14:38:18 Christian König wrote:
>
> Am 10.06.21 um 19:59 schrieb Christian König:
> > Am 10.06.21 um 19:50 schrieb Ondrej Zary:
> >> [SNIP]
> >>> I can't see how this is called from the nouveau code, only
> >>> possibility I
> >>> see is that it is maybe called through the AGP code somehow.
> >> Yes, you're right:
> >> [ 13.192663] Call Trace:
> >> [ 13.192678] dump_stack+0x54/0x68
> >> [ 13.192690] ttm_tt_init+0x11/0x8a [ttm]
> >> [ 13.192699] ttm_agp_tt_create+0x39/0x51 [ttm]
> >> [ 13.192840] nouveau_ttm_tt_create+0x17/0x22 [nouveau]
> >> [ 13.192856] ttm_tt_create+0x78/0x8c [ttm]
> >> [ 13.192864] ttm_bo_handle_move_mem+0x7d/0xca [ttm]
> >> [ 13.192873] ttm_bo_validate+0x92/0xc8 [ttm]
> >> [ 13.192883] ttm_bo_init_reserved+0x216/0x243 [ttm]
> >> [ 13.192892] ttm_bo_init+0x45/0x65 [ttm]
> >> [ 13.193018] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [ 13.193150] nouveau_bo_init+0x8c/0x94 [nouveau]
> >> [ 13.193273] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [ 13.193407] nouveau_bo_new+0x44/0x57 [nouveau]
> >> [ 13.193537] nouveau_channel_prep+0xa3/0x269 [nouveau]
> >> [ 13.193665] nouveau_channel_new+0x3c/0x5f7 [nouveau]
> >> [ 13.193679] ? slab_free_freelist_hook+0x3b/0xa7
> >> [ 13.193686] ? kfree+0x9e/0x11a
> >> [ 13.193781] ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> >> [ 13.193908] nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> >> [ 13.193924] ? pci_enable_device_flags+0x1e/0xac
> >> [ 13.194052] nouveau_drm_probe+0xeb/0x188 [nouveau]
> >> [ 13.194182] ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> >> [ 13.194195] pci_device_probe+0x89/0xe9
> >> [ 13.194205] really_probe+0x127/0x2a7
> >> [ 13.194212] driver_probe_device+0x5b/0x87
> >> [ 13.194219] device_driver_attach+0x2e/0x41
> >> [ 13.194226] __driver_attach+0x7c/0x83
> >> [ 13.194232] bus_for_each_dev+0x4c/0x66
> >> [ 13.194238] driver_attach+0x14/0x16
> >> [ 13.194244] ? device_driver_attach+0x41/0x41
> >> [ 13.194251] bus_add_driver+0xc5/0x16c
> >> [ 13.194258] driver_register+0x87/0xb9
> >> [ 13.194265] __pci_register_driver+0x38/0x3b
> >> [ 13.194271] ? 0xf0c0d000
> >> [ 13.194362] nouveau_drm_init+0x14c/0x1000 [nouveau]
> >>
> >> How is ttm_dma_tt->dma_address allocated?
> >
> > Mhm, I need to double check how AGP is supposed to work.
> >
> > Since barely anybody is using it these days it is something which
> > breaks from time to time.
>
> I have no idea how that ever worked in the first place since AGP isn't
> supposed to sync between CPU/GPU. Everything is coherent for that case.
>
> Anyway here is a patch which adds a check to those functions if the
> dma_address array is allocated in the first place. Please test it.
Thanks, the patch fixes the problem and nouveau now works!
Should be applied to 5.12-stable too (5.11 is affected too but EOL).
It's weird that it worked before.
Looks like dma_address was used uninitialized - it contained some random
crap:
[ 12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
[ 12.293321] ttm_dma->dma_address[0]=0x0
[ 12.293341] ttm_dma->dma_address[1]=0x0
[ 12.293360] ttm_dma->dma_address[2]=0xee728980
[ 12.293379] ttm_dma->dma_address[3]=0xed1cb120
[ 12.293397] ttm_dma->dma_address[4]=0x12
[ 12.293416] ttm_dma->dma_address[5]=0x0
[ 12.293434] ttm_dma->dma_address[6]=0x1
[ 12.293453] ttm_dma->dma_address[7]=0x0
[ 12.293471] ttm_dma->dma_address[8]=0x10000
[ 12.293490] ttm_dma->dma_address[9]=0x0
[ 12.293510] ttm_dma->dma_address[10]=0x101
[ 12.293528] ttm_dma->dma_address[11]=0xee7289ec
[ 12.293546] ttm_dma->dma_address[12]=0xee7289ec
[ 12.293564] ttm_dma->dma_address[13]=0x0
[ 12.293581] ttm_dma->dma_address[14]=0x0
[ 12.293599] ttm_dma->dma_address[15]=0x0
[ 12.293616] ttm_dma->dma_address[16]=0x0
[ 12.293634] ttm_dma->dma_address[17]=0x0
But it did not matter as dma_sync_single_for_device is a no-op here.
When dma_address is properly initialized to NULL, it crashes...
> Thanks,
> Christian.
>
> >
> > Thanks for the backtrace,
> > Christian.
> >
> >> I cannot find any assignment
> >> executed (in the working code):
> >>
> >> $ git grep dma_address\ = drivers/gpu/
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
> >> sg->sgl->dma_address = addr;
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
> >> &dma->dma_address[offset >> PAGE_SHIFT];
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
> >> (mm_node->start << PAGE_SHIFT) + offset;
> >> drivers/gpu/drm/i915/gvt/scheduler.c: sg->dma_address = addr;
> >> drivers/gpu/drm/i915/i915_gpu_error.c: sg->dma_address = it;
> >> drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = (void *)
> >> (ttm->ttm.pages + ttm->ttm.num_pages);
> >> drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address =
> >> kvmalloc_array(ttm->ttm.num_pages,
> >> drivers/gpu/drm/ttm/ttm_tt.c: ttm_dma->dma_address = NULL;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_phys_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_dma_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_sg_addr;
> >>
> >> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
> >> ttm_sg_tt_alloc_page_directory().
> >> Confirmed by adding printk()s that they're NOT called.
> >>
> >>
> >
>
>
--
Ondrej Zary
Powered by blists - more mailing lists