linux-kernel - Re: [PATCH v2 6/7] accel/rocket: Add job submission IOCTL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2365360.ElGaqSPkdT@workhorse>
Date: Tue, 29 Apr 2025 12:05:28 +0200
From: Nicolas Frattaroli <nicolas.frattaroli@...labora.com>
To: Rob Herring <robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>,
 Conor Dooley <conor+dt@...nel.org>, Heiko Stuebner <heiko@...ech.de>,
 Oded Gabbay <ogabbay@...nel.org>, Jonathan Corbet <corbet@....net>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Sumit Semwal <sumit.semwal@...aro.org>,
 Christian König <christian.koenig@....com>,
 Sebastian Reichel <sebastian.reichel@...labora.com>,
 Jeffrey Hugo <quic_jhugo@...cinc.com>, linux-rockchip@...ts.infradead.org,
 Tomeu Vizoso <tomeu@...euvizoso.net>
Cc: devicetree@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-rockchip@...ts.infradead.org, linux-kernel@...r.kernel.org,
 dri-devel@...ts.freedesktop.org, linux-doc@...r.kernel.org,
 linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
 Tomeu Vizoso <tomeu@...euvizoso.net>
Subject: Re: [PATCH v2 6/7] accel/rocket: Add job submission IOCTL

On Tuesday, 25 February 2025 08:55:52 Central European Summer Time Tomeu Vizoso wrote:
> Using the DRM GPU scheduler infrastructure, with a scheduler for each
> core.
> 
> Userspace can decide for a series of tasks to be executed sequentially
> in the same core, so SRAM locality can be taken advantage of.
> 
> The job submission code was initially based on Panfrost.
> 
> v2:
> - Remove hardcoded number of cores
> - Misc. style fixes (Jeffrey Hugo)
> - Repack IOCTL struct (Jeffrey Hugo)
> 
> Signed-off-by: Tomeu Vizoso <tomeu@...euvizoso.net>
> ---
>  drivers/accel/rocket/Makefile        |   3 +-
>  drivers/accel/rocket/rocket_core.c   |   6 +
>  drivers/accel/rocket/rocket_core.h   |  14 +
>  drivers/accel/rocket/rocket_device.c |   2 +
>  drivers/accel/rocket/rocket_device.h |   2 +
>  drivers/accel/rocket/rocket_drv.c    |  15 +
>  drivers/accel/rocket/rocket_drv.h    |   4 +
>  drivers/accel/rocket/rocket_job.c    | 710 +++++++++++++++++++++++++++++++++++
>  drivers/accel/rocket/rocket_job.h    |  50 +++
>  include/uapi/drm/rocket_accel.h      |  55 +++
>  10 files changed, 860 insertions(+), 1 deletion(-)

Hi Tomeu,

some more power management things I've noticed here.

>
> [...]
>
> diff --git a/drivers/accel/rocket/rocket_job.c b/drivers/accel/rocket/rocket_job.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..25b31f28e932aaee86173b9a0962932c9c640c03
> --- /dev/null
> +++ b/drivers/accel/rocket/rocket_job.c
>
> [...]
>
> +static void
> +rocket_reset(struct rocket_core *core, struct drm_sched_job *bad)
> +{
> +	bool cookie;
> +
> +	if (!atomic_read(&core->reset.pending))
> +		return;
> +
> +	/*
> +	 * Stop the scheduler.
> +	 *
> +	 * FIXME: We temporarily get out of the dma_fence_signalling section
> +	 * because the cleanup path generate lockdep splats when taking locks
> +	 * to release job resources. We should rework the code to follow this
> +	 * pattern:
> +	 *
> +	 *	try_lock
> +	 *	if (locked)
> +	 *		release
> +	 *	else
> +	 *		schedule_work_to_release_later
> +	 */
> +	drm_sched_stop(&core->sched, bad);
> +
> +	cookie = dma_fence_begin_signalling();
> +
> +	if (bad)
> +		drm_sched_increase_karma(bad);
> +
> +	/*
> +	 * Mask job interrupts and synchronize to make sure we won't be
> +	 * interrupted during our reset.
> +	 */
> +	rocket_write(core, REG_PC_INTERRUPT_MASK, 0x0);
> +	synchronize_irq(core->irq);
> +
> +	/* Handle the remaining interrupts before we reset. */
> +	rocket_job_handle_irq(core);
> +
> +	/*
> +	 * Remaining interrupts have been handled, but we might still have
> +	 * stuck jobs. Let's make sure the PM counters stay balanced by
> +	 * manually calling pm_runtime_put_noidle() and
> +	 * rocket_devfreq_record_idle() for each stuck job.
> +	 * Let's also make sure the cycle counting register's refcnt is
> +	 * kept balanced to prevent it from running forever
> +	 */
> +	spin_lock(&core->job_lock);
> +	if (core->in_flight_job)
> +		pm_runtime_put_noidle(core->dev);

This particular line of code caused me issues when I was experimenting with the
driver on RK3576. My current situation is that every job that gets submitted
times out because of some IRQs never arriving, which is besides the point, but
it did expose a power management bug here that I suspect may affect RK3588 as
well. After like 3 timeouts, we reset again and hang in the interrupt mask write
just a few lines above. This is because we somehow managed to get into a
situation where this function is called with pclk disabled, and this manual
balancing act may be part of it. Not doing the manual balancing at all here
doesn't fix it, I had to explicitly wrap the reset function in
pm_runtime_get_sync and pm_runtime_put_noidle to avoid having this function
execute with disabled clocks. That seems like the "wrong solution" though
because it means something already powered off our hardware too eagerly before
we got the reset done.

My gut instinct tells me that there's a bug here with multiple timed out jobs
being in flight, but I'm not 100% on it. Maybe you'll be able to reproduce it
on an RK3588 by artificially forcing all your jobs to time out with some very
low timeout value or by masking the interrupts.

> +
> +	core->in_flight_job = NULL;
> +	spin_unlock(&core->job_lock);
> +
> +	/* Proceed with reset now. */
> +	pm_runtime_force_suspend(core->dev);
> +	pm_runtime_force_resume(core->dev);

Do we need to guarantee some sort of exclusivity here so that nothing tries to
concurrently use the device while we're forcing it off and back on again? I'm
unfamiliar with the drm fence stuff, so I may be overthinking this.

> +
> +	/* GPU has been reset, we can clear the reset pending bit. */
> +	atomic_set(&core->reset.pending, 0);

Nitpick: should probably be "NPU" ;)

> +
> +	/*
> +	 * Now resubmit jobs that were previously queued but didn't have a
> +	 * chance to finish.
> +	 * FIXME: We temporarily get out of the DMA fence signalling section
> +	 * while resubmitting jobs because the job submission logic will
> +	 * allocate memory with the GFP_KERNEL flag which can trigger memory
> +	 * reclaim and exposes a lock ordering issue.
> +	 */
> +	dma_fence_end_signalling(cookie);
> +	drm_sched_resubmit_jobs(&core->sched);
> +	cookie = dma_fence_begin_signalling();
> +
> +	/* Restart the scheduler */
> +	drm_sched_start(&core->sched, 0);
> +
> +	dma_fence_end_signalling(cookie);
> +}
>
> [...]
>

Kind regards,
Nicolas Frattaroli