linux-kernel - Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <31f04dca-f7b7-e899-07ee-8c5f2dd55494@arm.com>
Date:   Mon, 17 Jul 2023 09:59:28 +0100
From:   Steven Price <steven.price@....com>
To:     Boris Brezillon <boris.brezillon@...labora.com>
Cc:     Dmitry Osipenko <dmitry.osipenko@...labora.com>,
        Rob Herring <robh@...nel.org>, dri-devel@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org, kernel@...labora.com
Subject: Re: [PATCH v1] drm/panfrost: Sync IRQ by job's timeout handler

On 17/07/2023 09:49, Boris Brezillon wrote:
> On Mon, 17 Jul 2023 09:06:56 +0100
> Steven Price <steven.price@....com> wrote:
> 
>> On 17/07/2023 08:49, Boris Brezillon wrote:
>>> On Mon, 17 Jul 2023 10:20:02 +0300
>>> Dmitry Osipenko <dmitry.osipenko@...labora.com> wrote:
>>>   
>>>> Hi,
>>>>
>>>> On 7/17/23 10:05, Boris Brezillon wrote:  
>>>>> Hi Dmitry,
>>>>>
>>>>> On Mon, 17 Jul 2023 09:52:54 +0300
>>>>> Dmitry Osipenko <dmitry.osipenko@...labora.com> wrote:
>>>>>     
>>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens
>>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to
>>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler
>>>>>> to sync IRQ before checking fence signal status in order to prevent
>>>>>> spurious job timeouts due to a slow IRQ processing.    
>>>>>
>>>>> Feels like the problem should be fixed in the HDMI encoder driver
>>>>> instead, so it doesn't stall the whole system when processing its
>>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the
>>>>> job timeout path to flush IRQs is a good strategy.    
>>>>
>>>> The syncing is necessary to have for correctness regardless of whether
>>>> it's HDMI problem or something else, there could be other reasons for
>>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while
>>>> it's not.  
>>>
>>> Well, hardware is effectively hung, if not indefinitely, at least
>>> temporarily. All you do here is block in the timeout handler path
>>> waiting for the GPU interrupt handlers to finish, handler that's
>>> probably waiting in the queue, because the raw HDMI handler is blocking
>>> it somehow. So, in the end, you might just be delaying the time of HWR a
>>> bit more. I know it's not GPU's fault in that case, and the job could
>>> have finished in time if the HDMI encoder hadn't stall the interrupt
>>> handling pipeline, but I'm not sure we should care for that specific
>>> situation. And more importantly, if it took more than 500ms to get a
>>> frame rendered (or, in that case, to get the event that a frame is
>>> rendered), you already lost, so I'm not sure correctness matters:
>>> rendering didn't make it in time, and the watchdog kicked in to try and
>>> unblock the situation. Feels like we're just papering over an HDMI
>>> encoder driver bug here, really.  
>>
>> TLDR; I don't see any major downsides and it stops the GPU getting the 
>> blame for something that isn't its fault.
> 
> True, but doing that will also give the impression that things run fine,
> but very slowly, which would put the blame on the userspace driver :P.

Maybe I'm tainted by years of the kernel driver getting the blame
because it was the one that printed the message ;p

>>
>> I guess the question is whether panfrost should work on a system which 
>> has terrible IRQ latency. At the moment we have a synchronize_irq() call 
>> in panfrost_reset() which effectively does the same thing, but with all 
>> the overhead/spew of resetting the GPU.
> 
> Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is
> mostly here to make sure there's no race between the interrupt
> handler and the reset logic (we mask interrupts, and then synchronize,
> guaranteeing that the interrupt handler won't be running after that
> point), and it happens after we've printed the error message, so the
> user knows something was blocked at least.

Yes the synchronize_irq() in panfrost_reset() is there to avoid a real
race - but it has the side effect of flushing out the IRQ and therefore
the job gets completed successfully. And in the high IRQ latency
situation makes the actual reset redundant.

>>
>> Of course in the case Dmitry is actually talking about - it does seem 
>> like the HDMI encoder has a bug which needs fixing. There are plenty of 
>> other things that will break if IRQ latency gets that bad.
> 
> Yes, that's my point. The GPU driver is the only one to complain right
> now, but the HDMI encoder behavior could be impacting other parts of
> the system. Silently ignoring those weirdnesses sounds like a terrible
> idea.

Agreed - but making it look like a GPU driver bug isn't good either.

>>
>> I do wonder if it makes sense to only synchronize when it's needed, 
>> e.g.:
>>
>> ----8<---
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index dbc597ab46fb..d96266b74e5c 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>>  	if (dma_fence_is_signaled(job->done_fence))
>>  		return DRM_GPU_SCHED_STAT_NOMINAL;
>>  
>> +	/* Synchronize with the IRQ handler in case the IRQ latency is bad */
>> +	synchronize_irq(pfdev->js->irq);
>> +	/* Recheck if the job is now complete */
>> +	if (dma_fence_is_signaled(job->done_fence))
>> +		return DRM_GPU_SCHED_STAT_NOMINAL;
>> +
>>  	dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>>  		js,
>>  		job_read(pfdev, JS_CONFIG(js)),
>> ----8<---
>>
>> I don't have any data as to how often we hit the case where the DRM 
>> scheduler calls the timeout but we've already signalled - so the extra 
>> check might be overkill.
> 
> Right, it's not so much about the overhead of the synchronize_irq()
> call (even though my first reply complained about that :-)), but more
> about silently ignoring system misbehaviors. So I guess I'd be fine with
> a version printing a dev_warn("Unexpectedly high interrupt latency")
> when synchronize_irq() unblocks the situation, which means you'd still
> have to do it in two steps.

I like this idea - it still warns so it's obvious there's something
wrong with the system, and it makes it clear it's not the GPU's fault.

Steve