linux-kernel - Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9a2bfe0d-3a1d-4917-bf10-33259c79a3bb@igalia.com>
Date: Tue, 19 Nov 2024 10:19:44 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: tj@...nel.org, void@...ifault.com, mingo@...hat.com,
 kernel-dev@...lia.com, linux-kernel@...r.kernel.org,
 Changwoo Min <changwoo@...lia.com>
Subject: Re: [PATCH 2/5] sched_ext: Manage the validity of scx_rq_clock

Hello,

Thank you for the prompt feedback. I hope the following answers
can clarify most of your doubts.

On 24. 11. 18. 18:41, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 12:46:32AM +0900, Changwoo Min wrote:
> 
>> The main reason to keep the second copy (rq->scx.clock) is that
>> a BPF scheduler can call scx_bpf_clock_get_ns() at almost any
>> time in any context, including any of sched_ext operations, BPF
>> timer callbacks, BPF syscalls, kprobes, and so on.
> 
> If it's going to be a BPF wide thing, why is it presented as part of
> sched_ext ? That makes no sense.

There is a confusion here. scx_bpf_clock_get_ns() is for BPF
schedulers, not for random BPF programs. In almost all cases, it
will be used in the shced_ext operations, such as ops.running()
and ops.stopping(), to implement scheduling policies. However, if
BPF schedulers use other BPF features, such as BPF timer,
scx_bpf_clock_get_ns() also can be used. For example, scx_lavd uses
a BPF timer for periodic background processing; scx_lavd and
scx_flash use kprobe to trace futex system calls. Also, since
scx_bpf_clock_get_ns() relies on rq lock, it is not meaningful
outside of the BPF schedulers. Hence, it should be a part of
sched_ext.

>> Another approach would be to extend struct sched_clock_data (in
>> kernel/sched/clock.c) to store the update flag
>> (SCX_RQ_CLK_UPDATED). This would be the best regarding the number
>> of cache line accesses. However, that would be an overkill since
>> now sched_clock_data stores the sched_ext-specific data.
>> I thought it would be better to keep sched_ext specific data in
>> one place, struct scx_rq, for managibility.
> 
> What's the purpose of that flag? Why can't BPF use sched_clock_local()
> and call it a day?

Let's suppose the following timeline:

   T1. rq_lock(rq)
   T2. update_rq_clock(rq)
   T3. a sched_ext BPF operation
   T4. rq_unlock(rq)
   T5. a sched_ext BPF operation
   T6. rq_lock(rq)
   T7. update_rq_clock(rq)

For [T2, T4), we consider that rq clock is valid
(SCX_RQ_CLK_UPDATED is set), so scx_bpf_clock_get_ns calls during
[T2, T4) (including T3) will return the rq clock updated at T2.
Let's think about what we should do for the duration [T4, T7)
when a BPF scheduler can still call scx_bpf_clock_get_ns (T5).
During that duration, we consider the rq clock is invalid
(SCX_RQ_CLK_UPDATED is unset). So when calling
scx_bpf_clock_get_ns at T5, we call sched_clock() to get the
fresh clock.

I think the term `UPDATED` was misleading. I will change it to
`VALID` in the next version.

> Growing sched_clock_data shouldn't be a problem, it's only 24 bytes, so
> we have plenty free bytes there.

Alright. I will change the current implementation and extend
`struct sched_clock_data` to store the `VALID` flag in the next
version.