linux-kernel - Re: [RFC PATCH] drm/sched: Add new KUnit test suite for concurrent job submission

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3cdc7fda0ee4c777d7e20504503522844f0124ea.camel@mailbox.org>
Date: Wed, 28 Jan 2026 17:17:02 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Marco Pagani <marco.pagani@...ux.dev>, tursulin@...ulin.net
Cc: airlied@...il.com, ckoenig.leichtzumerken@...il.com, dakr@...nel.org, 
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, 
 maarten.lankhorst@...ux.intel.com, marpagan@...hat.com,
 matthew.brost@...el.com,  mripard@...nel.org, phasta@...nel.org,
 simona@...ll.ch, tzimmermann@...e.de
Subject: Re: [RFC PATCH] drm/sched: Add new KUnit test suite for concurrent
 job submission

On Wed, 2026-01-28 at 12:39 +0100, Marco Pagani wrote:
> 
> On 22/01/2026 10:51, Tvrtko Ursulin wrote:
> > 
> > On 20/01/2026 20:52, Marco Pagani wrote:
> > > Add a new test suite to simulate concurrent job submissions from userspace.
> > > The suite includes a basic test case where each worker submits a single
> > > job, and a more advanced case involving the submission of multiple jobs.
> > 
> > New test coverage is welcome!
> 
> Hi Tvrtko, Philip, and thank you.
> 
> > But as Philipp has said some more context would be beneficial. Like are 
> > you trying to hit a bug, or extend later with something which will hit a 
> > bug and then you will propose improvements? Or simply improving the
> > coverage?
> 
> Sure, I'll extend the commit message to be more descriptive in the next
> version.
>  
> > If it is about some race I would maybe consider putting this into a new 
> > tests_races.c. I actually have this file locally and some unfinished 
> > test cases already, although it is unclear when I will be happy with 
> > them to post. But if the test is simply about adding coverage it is fine 
> > to live in tests_basic.c.
> 
> The general idea is to extend the suite with some initial tests that stress
> concurrency to spot race conditions. Having these initial tests grouped together
> with future ones in a new tests_races.c file makes perfect sense to me.


I am not so convinced of a separate file.

All the drm_sched tests are there to ensure the soundness and
robustness of the scheduler. How is a race special? What would
tests_basic.c be for – checking for bugs that are not races? And
races.c would be full of tests that are threaded?
(questions more directed at Tvrtko)

>From my POV having this little test in tests_basic.c is perfectly fine.
Having multiple entities per scheduler, associated with multiple task,
*is* basic functionality of the scheduler.


> 
> > > Signed-off-by: Marco Pagani <marpagan@...hat.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/tests/tests_basic.c | 175 ++++++++++++++++++
> > >   1 file changed, 175 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/tests/tests_basic.c b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> > > index 82a41a456b0a..7c25bcbbe7c9 100644
> > > --- a/drivers/gpu/drm/scheduler/tests/tests_basic.c
> > > +++ b/drivers/gpu/drm/scheduler/tests/tests_basic.c
> > > @@ -2,6 +2,7 @@
> > >   /* Copyright (c) 2025 Valve Corporation */
> > >   
> > >   #include <linux/delay.h>
> > > +#include <linux/completion.h>
> > >   
> > >   #include "sched_tests.h"
> > >   
> > > @@ -235,6 +236,179 @@ static void drm_sched_basic_cancel(struct kunit *test)
> > >   	KUNIT_ASSERT_EQ(test, job->hw_fence.error, -ECANCELED);
> > >   }
> > >   
> > > +struct sched_concurrent_test_context {
> > > +	struct drm_mock_scheduler *sched;
> > > +	struct workqueue_struct *sub_wq;
> > > +	struct completion wait_go;
> > > +};
> > > +
> > > +KUNIT_DEFINE_ACTION_WRAPPER(destroy_workqueue_wrap, destroy_workqueue,
> > > +			    struct workqueue_struct *);
> > > +
> > > +KUNIT_DEFINE_ACTION_WRAPPER(drm_mock_sched_fini_wrap, drm_mock_sched_fini,
> > > +			    struct drm_mock_scheduler *);
> > > +
> > > +KUNIT_DEFINE_ACTION_WRAPPER(drm_mock_sched_entity_free_wrap, drm_mock_sched_entity_free,
> > > +			    struct drm_mock_sched_entity *);
> > > +
> > > +static int drm_sched_concurrent_init(struct kunit *test)
> > > +{
> > > +	struct sched_concurrent_test_context *ctx;
> > > +	int ret;
> > > +
> > > +	ctx = kunit_kzalloc(test, sizeof(*ctx), GFP_KERNEL);
> > > +
> > > +	init_completion(&ctx->wait_go);
> > > +
> > > +	ctx->sched = drm_mock_sched_new(test, MAX_SCHEDULE_TIMEOUT);
> > > +
> > > +	ret = kunit_add_action_or_reset(test, drm_mock_sched_fini_wrap, ctx->sched);
> > > +	KUNIT_ASSERT_EQ(test, ret, 0);
> > > +
> > > +	/* Use an unbounded workqueue to maximize job submission concurrency */
> > > +	ctx->sub_wq = alloc_workqueue("drm-sched-submitters-wq", WQ_UNBOUND,
> > > +				      WQ_UNBOUND_MAX_ACTIVE);
> > > +	KUNIT_ASSERT_NOT_NULL(test, ctx->sub_wq);
> > > +
> > > +	ret = kunit_add_action_or_reset(test, destroy_workqueue_wrap, ctx->sub_wq);
> > > +	KUNIT_ASSERT_EQ(test, ret, 0);
> > > +
> > > +	test->priv = ctx;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +struct drm_sched_concurrent_params {
> > > +	const char *description;
> > > +	unsigned int job_base_us;
> > > +	unsigned int num_jobs;
> > > +	unsigned int num_subs;
> > > +};
> > > +
> > > +static const struct drm_sched_concurrent_params drm_sched_concurrent_cases[] = {
> > > +	{
> > > +		.description = "Concurrently submit a single job in a single entity",
> > > +		.job_base_us = 1000,
> > > +		.num_jobs = 1,
> > > +		.num_subs = 32,
> > > +	},
> > 
> > Why is submission from a single thread interesting if it is already covered?
> 
> These two initial parameter sets cover only concurrent submission:
> multiple submitters, single job / multiple submitters, multiple jobs.
> 
> > > +	{
> > > +		.description = "Concurrently submit multiple jobs in a single entity",
> > > +		.job_base_us = 1000,
> > > +		.num_jobs = 10,
> > > +		.num_subs = 64,
> > > +	},
> > > +};
> > > +
> > > +static void
> > > +drm_sched_concurrent_desc(const struct drm_sched_concurrent_params *params, char *desc)
> > > +{
> > > +	strscpy(desc, params->description, KUNIT_PARAM_DESC_SIZE);
> > > +}
> > > +
> > > +KUNIT_ARRAY_PARAM(drm_sched_concurrent, drm_sched_concurrent_cases, drm_sched_concurrent_desc);
> > > +
> > > +struct submitter_data {
> > > +	struct work_struct work;
> > > +	struct sched_concurrent_test_context *ctx;
> > > +	struct drm_mock_sched_entity *entity;
> > > +	struct drm_mock_sched_job **jobs;
> > > +	struct kunit *test;
> > > +	unsigned int id;
> > > +	bool timedout;
> > > +};
> > > +
> > > +static void drm_sched_submitter_worker(struct work_struct *work)
> > > +{
> > > +	const struct drm_sched_concurrent_params *params;
> > > +	struct sched_concurrent_test_context *ctx;
> > > +	struct submitter_data *sub_data;
> > > +	unsigned int i, duration_us;
> > > +	unsigned long timeout_jiffies;
> > > +	bool done;
> > > +
> > > +	sub_data = container_of(work, struct submitter_data, work);
> > > +	ctx = sub_data->ctx;
> > > +	params = sub_data->test->param_value;
> > > +
> > > +	wait_for_completion(&ctx->wait_go);
> > > +
> > > +	for (i = 0; i < params->num_jobs; i++) {
> > > +		duration_us = params->job_base_us + (sub_data->id * 10);
> > 
> > Why is job duration dependent by the submitter id?
> 
> Just a simple way to have a deterministic distribution of durations.
> I can change it if it doesn't fit.
> 
> > Would it be feasiable to not use auto-completing jobs and instead 
> > advance the timeline manually? Given how the premise of the test seems 
> > to be about concurrent submission it sounds plausible that what happens 
> > after submission maybe isn't very relevant.
> 
> Good idea! I'll run some experiments and see if it works.
> 
> > > +		drm_mock_sched_job_set_duration_us(sub_data->jobs[i], duration_us);
> > > +		drm_mock_sched_job_submit(sub_data->jobs[i]);
> > 
> > On a related note, one interesting thing to add coverage for later is 
> > multi-threaded submit of multiple jobs against a single entity. But it 
> > is not an immediate need. Just mentioning it as something interesting.
> 
> Currently, the test configures each submitter to submit multiple jobs
> against its own dedicated entity. I considered adding a test case for
> submitting multiple jobs against multiple entities, but I decided to
> leave it for the future.
> 
> > It would mean open coding drm_mock_sched_job_submit() as 
> > drm_sched_job_arm() and drm_sched_entity_push_job() and sticking some 
> > delay in between so two threads have the chance to interleave. Mock
> > scheduler does not handle it today, neither does the scheduler itself 
> > who punts responsibility to callers. So adding a test and making the 
> > mock scheduler handle that properly would serve as an example on how 
> > scheduler must be used. Or what can go bad if it isn't.
> 
> Do you mean having multiple (k)threads submitting against the same entity?
> Would that be used to model a multithread application that uses multiple queues?

Submitters to entities must always ensure their own synchronization
before pushing to it. And afterwards it's the scheduler's parallelism
that counts.
I don't see how multiple threads on an entity are worthwhile testing.
It's beyond our scope.


Thx.
P.


> 
> > > +	}
> > > +
> > > +	timeout_jiffies = usecs_to_jiffies(params->job_base_us * params->num_subs *
> > > +					   params->num_jobs * 10);
> > 
> > The timeout calculation could use a comment. You are using num_subs * 10 
> > to match the duratiot_us above being id * 10? With logic of calculating 
> > a pessimistic timeout?
> > 
> > Have you tried it with qemu to check if it is pessimistic enough?
> 
> I'll double check on that.
>  
> > > +	for (i = 0; i < params->num_jobs; i++) {
> > > +		done = drm_mock_sched_job_wait_finished(sub_data->jobs[i],
> > > +							timeout_jiffies);
> > > +		if (!done)
> > > +			sub_data->timedout = true;
> > > +	}
> > 
> > Technically you only need to wait on the last job but it is passable 
> > like this too.
> > 
> > Also, is it important for the worker to wait for completion or the main 
> > thread could simply wait for everything? Maybe that would simplify things.
> 
> I would say they serve different purposes. The completion is used to pause
> all worker threads until they are all created to ensure they start submitting
> jobs together to maximize concurrency.
>  
> > Manual timeline advance and this combined would mean the workers only 
> > submit jobs, while the main thread simply does 
> > drm_mock_sched_advance(sched, num_subs * num_jobs) and waits for last 
> > job from each submitter to finish.
> > 
> > Again, auto-completion and timeout reporting is something I do not 
> > immediate see is relevant for multi-threaded submission testing.
> > 
> > Maybe if you want to test the mock scheduler itself it could be, but 
> > then I would add it as separate entry in drm_sched_concurrent_cases[]. 
> > Like maybe have a flag/boolean "auto-complete jobs". So one without and 
> > one with that set.
> 
> I think it's a good idea and I'll experiment to see if it works.
>  
> > Other than that it looks tidy and was easy to follow. Only thing which 
> > slightly got me was the term "subs" since I don't intuitively associate 
> > it with a submitter but, well, a sub entity of some kind. Might be worth 
> > renaming that to submitter(s), or even dropping the prefix in some cases 
> > might be feasible (like sub(s)_data).
> 
> Agreed. I'll rename "subs" for better clarity.
> 
> > Regards,
> > 
> > Tvrtko
> > 
> 
> Cheers,
> Marco