[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac36d1e2-36a4-473c-9acf-e0a1fc7d3bfb@collabora.com>
Date: Mon, 27 Nov 2023 12:26:52 +0100
From: AngeloGioacchino Del Regno
<angelogioacchino.delregno@...labora.com>
To: Marek Szyprowski <m.szyprowski@...sung.com>,
Krzysztof Kozlowski <krzysztof.kozlowski@...aro.org>,
Boris Brezillon <boris.brezillon@...labora.com>
Cc: Steven Price <steven.price@....com>, tzimmermann@...e.de,
linux-kernel@...r.kernel.org, mripard@...nel.org,
dri-devel@...ts.freedesktop.org, wenst@...omium.org,
kernel@...labora.com,
"linux-samsung-soc@...r.kernel.org"
<linux-samsung-soc@...r.kernel.org>
Subject: Re: [PATCH] drm/panfrost: Really power off GPU cores in
panfrost_gpu_power_off()
Il 27/11/23 12:24, Marek Szyprowski ha scritto:
> On 24.11.2023 13:45, Marek Szyprowski wrote:
>> On 22.11.2023 10:29, Krzysztof Kozlowski wrote:
>>> On 22/11/2023 10:06, AngeloGioacchino Del Regno wrote:
>>>>>>> Hey Krzysztof,
>>>>>>>
>>>>>>> This is interesting. It might be about the cores that are missing
>>>>>>> from the partial
>>>>>>> core_mask raising interrupts, but an external abort on
>>>>>>> non-linefetch is strange to
>>>>>>> see here.
>>>>>> I've seen such external aborts in the past, and the fault type has
>>>>>> often been misleading. It's unlikely to have anything to do with a
>>>>> Yeah, often accessing device with power or clocks gated.
>>>>>
>>>> Except my commit does *not* gate SoC power, nor SoC clocks 🙂
>>> It could be that something (like clocks or power supplies) was missing
>>> on this board/SoC, which was not critical till your patch came.
>>>
>>>> What the "Really power off ..." commit does is to ask the GPU to
>>>> internally power
>>>> off the shaders, tilers and L2, that's why I say that it is strange
>>>> to see that
>>>> kind of abort.
>>>>
>>>> The GPU_INT_CLEAR GPU_INT_STAT, GPU_FAULT_STATUS and
>>>> GPU_FAULT_ADDRESS_{HI/LO}
>>>> registers should still be accessible even with shaders, tilers and
>>>> cache OFF.
>>>>
>>>> Anyway, yes, synchronizing IRQs before calling the poweroff sequence
>>>> would also
>>>> work, but that'd add up quite a bit of latency on the
>>>> runtime_suspend() call, so
>>>> in this case I'd be more for avoiding to execute any register r/w in
>>>> the handler
>>>> by either checking if the GPU is supposed to be OFF, or clearing
>>>> interrupts, which
>>>> may not work if those are generated after the execution of the
>>>> poweroff function.
>>>> Or we could simply disable the irq after power_off, but that'd be
>>>> hacky (as well).
>>>>
>>>>
>>>> Let's see if asking to poweroff *everything* works:
>>> Worked.
>>
>> Yes, I also got into this issue some time ago, but I didn't report it
>> because I also had some power supply related problems on my test farm
>> and everything was a bit unstable. I wasn't 100% sure that the
>> $subject patch is responsible for the observed issues. Now, after
>> fixing power supply, I confirm that the issue was revealed by the
>> $subject patch and above mentioned change fixes the problem. Feel free
>> to add:
>>
>> Tested-by: Marek Szyprowski <m.szyprowski@...sung.com>
>
>
> I must revoke my tested-by tag for the above fix alone. Although it
> fixed the boot issue and system stability issue, it looks that there is
> still something missing and opening the panfrost dri device causes a
> system crash:
>
> root@...get:~# ./modetest -C
> trying to open device 'i915'...failed
> trying to open device 'amdgpu'...failed
> trying to open device 'radeon'...failed
> trying to open device 'nouveau'...failed
> trying to open device 'vmwgfx'...failed
> trying to open device 'omapdrm'...failed
> trying to open device 'exynos'...done
> root@...get:~#
>
> 8<--- cut here ---
> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c6803c
> [f0c6803c] *pgd=42d87811, *pte=11800653, *ppte=11800453
> Internal error: : 1008 [#1] PREEMPT SMP ARM
> Modules linked in: exynos_gsc s5p_mfc s5p_jpeg v4l2_mem2mem
> videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 videobuf2_common
> videodev mc s5p_cec
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> 6.7.0-rc2-next-20231127-00055-ge14abcb527d6 #7649
> Hardware name: Samsung Exynos (Flattened Device Tree)
> PC is at panfrost_gpu_irq_handler+0x18/0xfc
> LR is at __handle_irq_event_percpu+0xcc/0x31c
> ...
> Process swapper/0 (pid: 0, stack limit = 0x0e2875ff)
> Stack: (0xc1301e48 to 0xc1302000)
> ...
> panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
> __handle_irq_event_percpu from handle_irq_event+0x38/0x80
> handle_irq_event from handle_fasteoi_irq+0x9c/0x250
> handle_fasteoi_irq from generic_handle_domain_irq+0x24/0x34
> generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
> gic_handle_irq from generic_handle_arch_irq+0x34/0x44
> generic_handle_arch_irq from __irq_svc+0x8c/0xd0
> Exception stack(0xc1301f10 to 0xc1301f58)
> ...
> __irq_svc from default_idle_call+0x20/0x2c4
> default_idle_call from do_idle+0x244/0x2b4
> do_idle from cpu_startup_entry+0x28/0x2c
> cpu_startup_entry from rest_init+0xec/0x190
> rest_init from arch_post_acpi_subsys_init+0x0/0x8
> Code: e591300c e593402c f57ff04f e591300c (e593903c)
> ---[ end trace 0000000000000000 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> CPU2: stopping
>
>
> It looks that the panfrost interrupts must be somehow synchronized with
> turning power off, what has been already discussed. Let me know if you
> want me to test any patch.
>
The new series containing the whole interrupts sync code is almost ready,
currently testing it on my machines here.
I should be able to send it between today and tomorrow.
Cheers,
Angelo
Powered by blists - more mailing lists