[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77b696b9-3248-d329-4f7d-5e27a21eabff@amd.com>
Date: Mon, 11 Jan 2021 21:45:46 +0100
From: Christian König <christian.koenig@....com>
To: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
Cc: amd-gfx list <amd-gfx@...ts.freedesktop.org>,
dri-devel <dri-devel@...ts.freedesktop.org>,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
Harry Wentland <harry.wentland@....com>,
"Deucher, Alexander" <alexander.deucher@....com>
Subject: Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
framebuffer with error -12
Hi Mike,
Am 11.01.21 um 20:23 schrieb Mikhail Gavrilov:
> On Mon, 11 Jan 2021 at 19:01, Christian König <christian.koenig@....com> wrote:
>
>> Changing the page table attributes while releasing memory might sleep.
>> So we can't use a spinlock here.
>>
>> Thanks for the report, a patch to fix this is on the mailing list now.
> Can you look also the first trace?
Unfortunately not, that's DC stuff. Easiest is to assign this as a bug
tracker to our DC team.
> Here a same error message "sleeping function called from invalid
> context" and a lot of [amdgpu] code.
[SNIP]
>>> -12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by
>>> the problem above, maybe something completely unrelated.
>>>
>>> I will take a look.
>> The looks like a completely unrelated memory leak to me.
>>
>> Probably best if you open up a bug report for this.
> Yes, the monitor still turns off after applying patch "make the pool
> shrinker lock a mutex".
> Anyway patch fixed the issue with flood of message "BUG: sleeping
> function called from invalid context at mm/vmalloc.c:1756" so kernel
> log became cleaner.
At least some progress. Any objections that I add your e-mail address as
tested-by tag?
> Now the issue with turns off monitor looks in logs so:
>
> DMA-API: cacheline tracking ENOMEM, dma-debug disabled
> amdgpu 0000:0b:00.0: amdgpu: 000000006b791523 pin failed
> [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
> framebuffer with error -12
> BUG: kernel NULL pointer dereference, address: 0000000000000060
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP NOPTI
> CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: G W ---------
> --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 2802 10/21/2020
> RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm]
> Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0
> 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b
> 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65
> RSP: 0018:ffffa7400532b9c0 EFLAGS: 00010286
> RAX: ffff978e2ae25800 RBX: ffff97910ec12058 RCX: ffff978e12caac70
> RDX: 0000000080000010 RSI: 0000000000000000 RDI: ffff97912c3d99c0
> RBP: ffff97912c3d99c0 R08: 0000000000000000 R09: 0000000070b3a000
> R10: 0000000000000002 R11: 0000000000000000 R12: ffff97912c3d99c0
> R13: 0000000000000000 R14: ffffa7400532ba90 R15: ffff978e182c6350
> FS: 00007f070bb1b640(0000) GS:ffff979509200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000060 CR3: 00000001f0cd2000 CR4: 0000000000350ee0
> Call Trace:
> ttm_tt_populate+0xa9/0xe0 [ttm]
> ttm_bo_handle_move_mem+0x142/0x180 [ttm]
> ttm_bo_validate+0x12e/0x1c0 [ttm]
I can take a look at this one here. Looks like some missing error
handling when allocating memory.
Can you decode to which line number ttm_tt_swapin+0x34 points to?
[SNIP]
> You said that I need open up a bug report you means site
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7C75040f5053404b0f302b08d8b666769b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637459898491581880%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IbkSfHK%2BD13OCcYMg%2BlNsZixi9gDEQEfS7Mxyf7vGdM%3D&reserved=0 ?
> I thought mailing lists is better because bug report on
> bugzilla.kernel.org usually leave opened for several years without
> attention.
Please use this one here:
https://gitlab.freedesktop.org/drm/amd/-/issues/new
If you can't find the DC guys of hand in the assignee list just assign
to me and I will forward.
But what you have in your logs so far are only unrelated symptoms, the
root of the problem is that somebody is leaking memory.
What you could do as well is to try to enable kmemleak and maybe try
some bleeding edge branch like drm-misc-fixes or Alex
amd-staging-drm-next branch.
Thanks for the help,
Christian.
Powered by blists - more mailing lists