linux-kernel - Re: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aOQaK--p1-6TEun5@gpd4>
Date: Mon, 6 Oct 2025 21:36:11 +0200
From: Andrea Righi <arighi@...dia.com>
To: Ryan Newton <rrnewton@...il.com>
Cc: linux-kernel@...r.kernel.org, sched-ext@...ts.linux.dev, tj@...nel.org,
	newton@...a.com
Subject: Re: [PATCH v3 2/2] sched_ext: Add a selftest for scx_bpf_dsq_peek

Hi Ryan,

On Mon, Oct 06, 2025 at 01:04:03PM -0400, Ryan Newton wrote:
> From: Ryan Newton <newton@...a.com>
> 
> Perform the most basic unit test: make sure an empty queue peeks as
> empty, and when we put one element in the queue, make sure peek returns
> that element.
> 
> However, even this simple test is a little complicated by the different
> behavior of scx_bpf_dsq_insert in different calling contexts:
>  - insert is for direct dispatch in enqueue
>  - insert is delayed when called from select_cpu
> 
> In this case we split the insert and the peek that verifies the
> result between enqueue/dispatch. As a second phase, we stress test by
> performing many peeks on an array of user DSQs.
> 
> Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
> empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
> the buffered insert. Unfortunately, this is not viable within the
> enqueue path, as it attempts a voluntary context switch within an RCU
> read-side critical section.

This test is a bit difficult to follow and it's completely hammering the VM
where I'm running it. I think we should avoid adding selftests that are too
heavy, as they might trigger false positives and lead distro maintainers to
ignore or disable them in their CI/CD.

Maybe we should go with something simpler to test the basic functionality
of this new API and validate the expected beavior.

For instance, most BPF schedulers using this API will likely implement
something like the following in their ops.dispatch():

	u64 min_vtime = ULLONG_MAX;
	u64 dsq_id, target_dsq;

        bpf_for(dsq_id, 0, MAX_DSQ) {
                struct task_struct *p = __COMPAT_scx_bpf_dsq_peek(dsq_id);

                if (p && bpf_cpumask_test_cpu(from_cpu, p->cpus_ptr) &&
                    p->scx.dsq_vtime < min_vtime) {
                        min_vtime = p->scx.dsq_vtime;
                        target_dsq = dsq_id;
                }
        }

	if (min_vtime < ULLONG_MAX)
		scx_bpf_dsq_move_to_local(min_cpu)

I think this is the specific use case we want to make sure doesn't break in
the future.

A way to validate this could be to use a global counter (vtime_enq), every
time a task is queued, increment vtime_enq and use the value with
scx_dsq_insert_vtime() and insert the task to a DSQ from a pool of DSQs (up
to MAX_DSQ).

Then in ops.dispatch() we can always consume the task with the minimum
vtime, using the logic above and verify that min_vtime is always
incremented by one, or something similar.

That said, we can go with your approach for now and improve the selftest
later in the future, you don't have to completely rewrite the test. But I
think we should make it a bit more lightweight at least, maybe reduce the
workers or similar.

Also, few minor comments below.

> 
> Signed-off-by: Ryan Newton <newton@...a.com>
...
> diff --git a/tools/testing/selftests/sched_ext/peek_dsq.bpf.c b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> new file mode 100644
> index 000000000000..8d179d4c7efb
> --- /dev/null
> +++ b/tools/testing/selftests/sched_ext/peek_dsq.bpf.c
> @@ -0,0 +1,265 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * A BPF program for testing DSQ operations including create, destroy,
> + * and peek operations. Uses a hybrid approach:
> + * - Syscall program for DSQ lifecycle (create/destroy)
> + * - Struct ops scheduler for task insertion/dequeue testing
> + *
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Ryan Newton <ryan.newton@...m.mit.edu>
> + */
> +
> +#include <scx/common.bpf.h>
> +#include <scx/compat.bpf.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define MAX_SAMPLES 100
> +#define MAX_CPUS 512
> +#define DSQ_POOL_SIZE 8
> +int max_samples = MAX_SAMPLES;
> +int max_cpus = MAX_CPUS;
> +int dsq_pool_size = DSQ_POOL_SIZE;
> +
> +/* Global variables to store test results */
> +int dsq_create_result = -1;
> +int dsq_destroy_result = -1;
> +int dsq_peek_result1 = -1;
> +long dsq_inserted_pid = -1;
> +int insert_test_cpu = -1; /* Set to the cpu that performs the test */
> +long dsq_peek_result2 = -1;
> +long dsq_peek_result2_pid = -1;
> +long dsq_peek_result2_expected = -1;
> +int test_dsq_id = 1234; /* Use a simple ID like create_dsq example */
> +int real_dsq_id = 1235; /* DSQ for normal operation */
> +int enqueue_count = -1;
> +int dispatch_count = -1;
> +int debug_ksym_exists = -1;

Maybe use a bool here.

> +
> +/* DSQ pool for stress testing */
> +int dsq_pool_base_id = 2000;
> +int phase1_complete = -1;
> +int total_peek_attempts = -1;
> +int successful_peeks = -1;
> +
> +/* BPF map for sharing peek results with userspace */
> +struct {
> +	__uint(type, BPF_MAP_TYPE_ARRAY);
> +	__uint(max_entries, MAX_SAMPLES);
> +	__type(key, u32);
> +	__type(value, long);
> +} peek_results SEC(".maps");
> +
> +/* Test if we're actually using the native or compat version */
> +int check_dsq_insert_ksym(void)
> +{
> +	return bpf_ksym_exists(scx_bpf_dsq_insert) ? 1 : 0;
> +}

This helper is unused, we can drop it.

> +
> +int check_dsq_peek_ksym(void)
> +{
> +	return bpf_ksym_exists(scx_bpf_dsq_peek) ? 1 : 0;
> +}

Do we need this helper? I think we can just do:

  debug_ksym_exists = bpf_ksym_exists(scx_bpf_dsq_peek);

And have debug_ksym_exists as bool.

> +
> +static inline int get_random_dsq_id(void)
> +{
> +	u64 time = bpf_ktime_get_ns();
> +
> +	return dsq_pool_base_id + (time % DSQ_POOL_SIZE);
> +}
> +
> +static inline void record_peek_result(long pid)

Does it really need to be inline? It's quite of a big function.

> +{
> +	u32 slot_key;
> +	long *slot_pid_ptr;
> +	int ix;
> +
> +	if (pid <= 0)
> +		return;
> +
> +	/* Find an empty slot or one with the same PID */
> +	bpf_for(ix, 0, 10) {
> +		slot_key = (pid + ix) % MAX_SAMPLES;
> +		slot_pid_ptr = bpf_map_lookup_elem(&peek_results, &slot_key);
> +		if (!slot_pid_ptr)
> +			continue;
> +
> +		if (*slot_pid_ptr == -1 || *slot_pid_ptr == pid) {
> +			*slot_pid_ptr = pid;
> +			break;
> +		}
> +	}
> +}
> +
> +/* Scan all DSQs in the pool and try to move a task to local */
> +static inline int scan_dsq_pool(void)

This is also quite big, we can drop inline and let the compiler decide.

> +{
> +	struct task_struct *task;
> +	int moved = 0;
> +	int i;
> +
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		total_peek_attempts++;
> +
> +		task = __COMPAT_scx_bpf_dsq_peek(dsq_id);
> +		if (task) {
> +			successful_peeks++;
> +			record_peek_result(task->pid);
> +
> +			/* Try to move this task to local */
> +			if (!moved && scx_bpf_dsq_move_to_local(dsq_id) == 0) {
> +				moved = 1;
> +				break;
> +			}
> +		}
> +	}
> +	return moved;
> +}
> +
> +/* Struct_ops scheduler for testing DSQ peek operations */
> +void BPF_STRUCT_OPS(peek_dsq_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	struct task_struct *peek_result;
> +	int last_insert_test_cpu, cpu;
> +
> +	enqueue_count++;
> +	cpu = bpf_get_smp_processor_id();
> +	last_insert_test_cpu = __sync_val_compare_and_swap(
> +		&insert_test_cpu, -1, cpu);

This can be in a single line.

> +
> +	/* Phase 1: Simple insert-then-peek test (only on first task) */
> +	if (last_insert_test_cpu == -1) {
> +		bpf_printk("peek_dsq_enqueue beginning phase 1 peek test on cpu %d\n", cpu);

No need the "\n" when you use bpf_printk().

> +
> +		/* Test 1: Peek empty DSQ - should return NULL */
> +		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
> +		dsq_peek_result1 = (long)peek_result; /* Should be 0 (NULL) */
> +
> +		/* Test 2: Insert task into test DSQ for testing in dispatch callback */
> +		dsq_inserted_pid = p->pid;
> +		scx_bpf_dsq_insert(p, test_dsq_id, 0, enq_flags);
> +		dsq_peek_result2_expected = (long)p; /* Expected the task we just inserted */
> +	} else if (!phase1_complete) {
> +		/* Still in phase 1, use real DSQ */
> +		scx_bpf_dsq_insert(p, real_dsq_id, 0, enq_flags);
> +	} else {
> +		/* Phase 2: Random DSQ insertion for stress testing */
> +		int random_dsq_id = get_random_dsq_id();
> +
> +		scx_bpf_dsq_insert(p, random_dsq_id, 0, enq_flags);
> +	}
> +}
> +
> +void BPF_STRUCT_OPS(peek_dsq_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	dispatch_count++;
> +
> +	/* Phase 1: Complete the simple peek test if we inserted a task but
> +	 * haven't tested peek yet
> +	 */
> +	if (insert_test_cpu == cpu && dsq_peek_result2 == -1) {
> +		struct task_struct *peek_result;
> +
> +		bpf_printk("peek_dsq_dispatch completing phase 1 peek test on cpu %d\n", cpu);

Ditto about "\n".

> +
> +		/* Test 3: Peek DSQ after insert - should return the task we inserted */
> +		peek_result = __COMPAT_scx_bpf_dsq_peek(test_dsq_id);
> +		/* Store the PID of the peeked task for comparison */
> +		dsq_peek_result2 = (long)peek_result;
> +		dsq_peek_result2_pid = peek_result ? peek_result->pid : -1;
> +
> +		/* Now consume the task since we've peeked at it */
> +		scx_bpf_dsq_move_to_local(test_dsq_id);
> +
> +		/* Mark phase 1 as complete */
> +		phase1_complete = 1;
> +		bpf_printk("Phase 1 complete, starting phase 2 stress testing\n");

Ditto.

> +	} else if (!phase1_complete) {
> +		/* Still in phase 1, use real DSQ */
> +		scx_bpf_dsq_move_to_local(real_dsq_id);
> +	} else {
> +		/* Phase 2: Scan all DSQs in the pool and try to move a task */
> +		if (!scan_dsq_pool()) {
> +			/* No tasks found in DSQ pool, fall back to real DSQ */
> +			scx_bpf_dsq_move_to_local(real_dsq_id);
> +		}
> +	}
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(peek_dsq_init)
> +{
> +	s32 err;
> +	int i;
> +
> +	/* Always set debug values so we can see which version we're using */
> +	debug_ksym_exists = check_dsq_peek_ksym();

As mentioned above this one can be:

	debug_ksym_exists = bpf_ksym_exists(scx_bpf_dsq_peek);

> +
> +	/* Initialize state first */
> +	insert_test_cpu = -1;
> +	enqueue_count = 0;
> +	dispatch_count = 0;
> +	phase1_complete = 0;
> +	total_peek_attempts = 0;
> +	successful_peeks = 0;
> +	dsq_create_result = 0; /* Reset to 0 before attempting */
> +
> +	/* Create the test and real DSQs */
> +	err = scx_bpf_create_dsq(test_dsq_id, -1);
> +	if (!err)
> +		err = scx_bpf_create_dsq(real_dsq_id, -1);
> +	if (err) {
> +		dsq_create_result = err;
> +		scx_bpf_error("Failed to create primary DSQ %d: %d", test_dsq_id, err);
> +		return err;
> +	}

How about:

	err = scx_bpf_create_dsq(test_dsq_id, -1);
	if (err) {
		dsq_create_result = err;
		scx_bpf_error("Failed to create DSQ %d: %d", test_dsq_id, err);
		return err;
	}
	err = scx_bpf_create_dsq(real_dsq_id, -1);
	if (err) {
		dsq_create_result = err;
		scx_bpf_error("Failed to create DSQ %d: %d", real_dsq_id, err);
		return err;
	}

Also do we need to save the error in dsq_create_result (similar with the
other error variables)?

We could just use scx_bpf_erroro(), have UEI_RECORD(uei, ei) in ops.exit()
and use SCX_EQ() in peek_dsq.c to catch error conditions and trigger a
failure, see for example allowed_cpus[.bpf].c.

> +
> +	/* Create the DSQ pool for stress testing */
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		err = scx_bpf_create_dsq(dsq_id, -1);
> +		if (err) {
> +			dsq_create_result = err;
> +			scx_bpf_error("Failed to create DSQ pool entry %d: %d", dsq_id, err);
> +			return err;
> +		}
> +	}
> +
> +	dsq_create_result = 1; /* Success */
> +
> +	/* Initialize the peek results map */
> +	bpf_for(i, 0, MAX_SAMPLES) {
> +		u32 key = i;
> +		long pid = -1;
> +
> +		bpf_map_update_elem(&peek_results, &key, &pid, BPF_ANY);
> +	}
> +
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(peek_dsq_exit, struct scx_exit_info *ei)
> +{
> +	int i;
> +
> +	scx_bpf_destroy_dsq(test_dsq_id);
> +	scx_bpf_destroy_dsq(real_dsq_id);
> +	bpf_for(i, 0, DSQ_POOL_SIZE) {
> +		int dsq_id = dsq_pool_base_id + i;
> +
> +		scx_bpf_destroy_dsq(dsq_id);
> +	}
> +
> +	dsq_destroy_result = 1;
> +}
> +
> +SEC(".struct_ops.link")
> +struct sched_ext_ops peek_dsq_ops = {
> +	.enqueue = (void *)peek_dsq_enqueue,
> +	.dispatch = (void *)peek_dsq_dispatch,
> +	.init = (void *)peek_dsq_init,
> +	.exit = (void *)peek_dsq_exit,
> +	.name = "peek_dsq",
> +};

Thanks,
-Andrea