linux-kernel - Re: [PATCH v6 2/6] sched_ext: Implement scx_bpf_now

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z2Xh3x8CCLPardgm@gpd3>
Date: Fri, 20 Dec 2024 22:30:07 +0100
From: Andrea Righi <arighi@...dia.com>
To: Changwoo Min <multics69@...il.com>
Cc: tj@...nel.org, void@...ifault.com, mingo@...hat.com,
	peterz@...radead.org, changwoo@...lia.com, kernel-dev@...lia.com,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v6 2/6] sched_ext: Implement scx_bpf_now_ns()

Hi Changwoo,

On Fri, Dec 20, 2024 at 03:20:21PM +0900, Changwoo Min wrote:
...
> +/**
> + * scx_bpf_now_ns - Returns a high-performance monotonically non-decreasing
> + * clock for the current CPU. The clock returned is in nanoseconds.
> + *
> + * It provides the following properties:
> + *
> + * 1) High performance: Many BPF schedulers call bpf_ktime_get_ns() frequently
> + *  to account for execution time and track tasks' runtime properties.
> + *  Unfortunately, in some hardware platforms, bpf_ktime_get_ns() -- which
> + *  eventually reads a hardware timestamp counter -- is neither performant nor
> + *  scalable. scx_bpf_now_ns() aims to provide a high-performance clock by
> + *  using the rq clock in the scheduler core whenever possible.
> + *
> + * 2) High enough resolution for the BPF scheduler use cases: In most BPF
> + *  scheduler use cases, the required clock resolution is lower than the most
> + *  accurate hardware clock (e.g., rdtsc in x86). scx_bpf_now_ns() basically
> + *  uses the rq clock in the scheduler core whenever it is valid. It considers
> + *  that the rq clock is valid from the time the rq clock is updated
> + *  (update_rq_clock) until the rq is unlocked (rq_unpin_lock).
> + *
> + * 3) Monotonically wq	X`on-decreasing clock for the same CPU: scx_bpf_now_ns()
> + *  guarantees the clock never goes backward when comparing them in the same
> + *  CPU. On the other hand, when comparing clocks in different CPUs, there
> + *  is no such guarantee -- the clock can go backward. It provides a
> + *  monotonically *non-decreasing* clock so that it would provide the same
> + *  clock values in two different scx_bpf_now_ns() calls in the same CPU
> + *  during the same period of when the rq clock is valid.
> + */
> +__bpf_kfunc u64 scx_bpf_now_ns(void)
> +{
> +	struct rq *rq;
> +	u64 clock;
> +
> +	preempt_disable();
> +
> +	/*
> +	 * If the rq clock is valid, use the cached rq clock.
> +	 * Otherwise, return a fresh rq glock.

s/glock/clock/

> +	 *
> +	 * Note that scx_bpf_now_ns() is re-entrant between a process
> +	 * context and an interrupt context (e.g., timer interrupt).
> +	 * However, we don't need to consider the race between them
> +	 * because such race is not observable from a caller.
> +	 */
> +	rq = this_rq();
> +	clock = READ_ONCE(rq->scx.clock);
> +
> +	if (!(READ_ONCE(rq->scx.flags) & SCX_RQ_CLK_VALID)) {
> +		clock = sched_clock_cpu(cpu_of(rq));
> +
> +		/*
> +		 * The rq clock is updated outside of the rq lock.
> +		 * In this case, keep the updated rq clock invalid so the next
> +		 * kfunc call outside the rq lock gets a fresh rq clock.
> +		 */
> +		scx_rq_clock_update(rq, clock, false);
> +	}

I was wondering if we could use a special value for clock (like ~0ULL or
similar) to mark the clock as invalid.

This way, we could get rid of the extra READ_ONCE(rq->scx.flags) logic for
checking the clock validity. And if the actual clock happens to match the
special value, we'd simply re-read the TSC, which shouldn't be a big issue
in theory.

That said, I'm not sure if this would yield any real performance benefits,
so the current approach is probably fine as it is, therefore feel free to
ignore this.

-Andrea