linux-kernel - Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <73a1dc37-f862-f908-4c9f-64e256283857@arm.com>
Date:   Wed, 20 May 2020 15:02:15 +0100
From:   Steven Price <steven.price@....com>
To:     Dinghao Liu <dinghao.liu@....edu.cn>, kjlu@....edu
Cc:     Rob Herring <robh@...nel.org>,
        Tomeu Vizoso <tomeu.vizoso@...labora.com>,
        Alyssa Rosenzweig <alyssa.rosenzweig@...labora.com>,
        David Airlie <airlied@...ux.ie>,
        Daniel Vetter <daniel@...ll.ch>,
        dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error

On 20/05/2020 12:05, Dinghao Liu wrote:
> pm_runtime_get_sync() increments the runtime PM usage counter even
> the call returns an error code. Thus a pairing decrement is needed
> on the error handling path to keep the counter balanced.
> 
> Signed-off-by: Dinghao Liu <dinghao.liu@....edu.cn>

Actually I think we have the opposite problem. To be honest we don't 
handle this situation very well. By the time panfrost_job_hw_submit() is 
called the job has already been added to the pfdev->jobs array, so it's 
considered submitted even if it never actually lands on the hardware. So 
in the case of this function bailing out early we will then (eventually) 
hit a timeout and trigger a GPU reset.

panfrost_job_timedout() iterates through the pfdev->jobs array and calls 
pm_runtime_put_noidle() for each job it finds. So there's no inbalance 
here that I can see.

Have you actually observed the situation where pm_runtime_get_sync() 
returns a failure?

HOWEVER, it appears that by bailing out early the call to 
panfrost_devfreq_record_busy() is never made, which as far as I can see 
means that there may be an extra call to panfrost_devfreq_record_idle() 
when the jobs have timed out. Which could underflow the counter.

But equally looking at panfrost_job_timedout(), we only call 
panfrost_devfreq_record_idle() *once* even though multiple jobs might be 
processed.

There's a completely untested patch below which in theory should fix that...

Steve

----8<---
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 7914b1570841..f9519afca29d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct 
panfrost_job *job, int js)
  	u64 jc_head = job->jc;
  	int ret;

+	panfrost_devfreq_record_busy(pfdev);
+
  	ret = pm_runtime_get_sync(pfdev->dev);
  	if (ret < 0)
  		return;
@@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct 
panfrost_job *job, int js)
  	}

  	cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
-	panfrost_devfreq_record_busy(pfdev);

  	job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
  	job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
@@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct 
drm_sched_job *sched_job)
  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
  		if (pfdev->jobs[i]) {
  			pm_runtime_put_noidle(pfdev->dev);
+			panfrost_devfreq_record_idle(pfdev);
  			pfdev->jobs[i] = NULL;
  		}
  	}
  	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);

-	panfrost_devfreq_record_idle(pfdev);
  	panfrost_device_reset(pfdev);

  	for (i = 0; i < NUM_JOB_SLOTS; i++)