linux-kernel - Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and add an event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <16a9e631-22c0-454a-b56f-c5206ec3d1dc@igalia.com>
Date: Fri, 17 Jan 2025 16:08:45 +0900
From: Changwoo Min <changwoo@...lia.com>
To: Tejun Heo <tj@...nel.org>
Cc: void@...ifault.com, arighi@...dia.com, kernel-dev@...lia.com,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/7] sched_ext: Implement event counter infrastructure and
 add an event

Hello,

Thank you for the review!

On 25. 1. 17. 10:33, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 17, 2025 at 12:15:37AM +0900, Changwoo Min wrote:
>> +/*
>> + * Collection of event counters.
>> + */
>> +struct scx_event_stat {
> 
> If we keep this struct, maybe name it scx_event_stats?

Sure, I will change it as suggested. Also, I will change
scx_bpf_event_stat() to scx_bpf_event_stats() for consistency.

> 
>> +	/*
>> +	 * If ops.select_cpu() returns a CPU which can't be used by the task,
>> +	 * the core scheduler code silently picks a fallback CPU.
>> +	 */
>> +	u64		INVAL_SELECT_CPU;
> 
> Let's do $COMPONENT_$EVENT, so SELECT_CPU_INVALID (or maybe something more
> specific than invalid, like disallowed or fallback?).

$COMPONENT_$EVENT sounds more systematic. Thanks for the
suggestion. What about SELECT_CPU_FALLBACK?


>> +#define SCX_EVENT_IDX(e)	(offsetof(struct scx_event_stat, e)/sizeof(u64))
>> +#define SCX_EVENT_END_IDX()	(sizeof(struct scx_event_stat)/sizeof(u64))
>> +#define SCX_EVENT_DEFINE(e)	SCX_EVENT_##e = SCX_EVENT_IDX(e)
>> +
>> +/*
>> + * Types of event counters.
>> + */
>> +enum scx_event_kind {
>> +	SCX_EVENT_BEGIN = 0,
> 
> I think 0 BEGIN value can be implicit.

Yup, I will remove it in the next version.

>> +	SCX_EVENT_DEFINE(INVAL_SELECT_CPU),
>> +	SCX_EVENT_END = SCX_EVENT_END_IDX(),
> 
> SCX_NR_EVENTS? Also, we don't really need to macro to count the enums.

For SCX_EVENT_BEGIN/END, I followed the style of
SCX_OPI_BEGIN/END. You are right. I will remove
SCX_EVENT_END_IDX() in the next version.

> This feels a bit odd to me. Why does it need both the struct fields and
> indices? Can we do either one of those?

Before submitting the patch set, I tested one approach using
purely enums storing the actual counters in an array and another
(current) approach using struct fields. I rejected the first
approach because the BPF verifier does not allow changing the
array size, causing the BPF binary compatibility problem. The
second approach is satisfactory regarding the BPF binary
compatibility by CO-RE. However, it is a little bit cumbersome to
iterate all the events, so I added the enums. I know the enums
are only usable in the kernel code due to the binary
compatibility, but I thought adding the enums rather than
bloating the scx_dump_state() code at the end would be better.
Does it make sense to you?


> 
>> +static const char *scx_event_stat_str[] = {
>> +	[SCX_EVENT_INVAL_SELECT_CPU]	= "invalid_select_cpu",
>> +};
> 
> and hopefully derive this through # macro arg expansion?

That's a good idea. I will change it.

> 
>> +/*
>> + * The event counter is organized by a per-CPU variable to minimize the
>> + * accounting overhead without synchronization. A system-wide view on the
>> + * event counter is constructed when requested by scx_bpf_get_event_stat().
>> + */
>> +static DEFINE_PER_CPU(struct scx_event_stat, event_stats);
> 
> May be better to include "cpu" in the variable name.

I will add a "_cpu" suffix, renaming it to "event_stats_cpu".

>> +/**
>> + * scx_add_event - Increase an event counter for 'name' by 'cnt'
>> + * @name: an event name defined in struct scx_event_stat
>> + * @cnt: the number of the event occured
>> + */
>> +#define scx_add_event(name, cnt) ({						\
>> +	struct scx_event_stat *__e;						\
>> +	__e = get_cpu_ptr(&event_stats);					\
>> +	WRITE_ONCE(__e->name, __e->name+ (cnt));				\
>> +	put_cpu_ptr(&event_stats);						\
> 
> this_cpu_add()?

That's handier. I will change it.

> 
>> @@ -3607,8 +3667,10 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
>>   		*ddsp_taskp = NULL;
>>   		if (ops_cpu_valid(cpu, "from ops.select_cpu()"))
>>   			return cpu;
>> -		else
>> +		else {
>> +			scx_add_event(INVAL_SELECT_CPU, 1);
>>   			return prev_cpu;
>> +		}
> 
> formatting:
> 
>          if () {
>          } else {
>          }
My bad. Will fix it.

> Also, I'm not sure this is a useful event to count. False ops_cpu_valid()
> indicates that the returned CPU is not even possible and the scheduler is
> ejected right away. What's more interesting is
> kernel/sched/core.c::select_task_rq() tripping on !is_cpu_allowed() and
> falling back using select_fallback_rq().
> 
> We can either hook into core.c or just compare the ops.select_cpu() picked
> CPU against the CPU the task ends up on in enqueue_task_scx().

Modifying core.c will be more direct and straightforward. Also,
I think it would be better to separate this commit into two: one
for the infra-structure and another for the SELECT_CPU_FALLBACK
event, which touches core.c. I will move the necessary code in
the infrastructure into kernel/sched/sched.h, so we can use
scx_add_event() in core.c.

> 
>> @@ -5053,6 +5115,15 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
>>   		scx_rq_clock_invalidate(rq);
>>   	}
>>   
>> +	/*
>> +	 * Clear event counters so the next scx scheduler always gets
>> +	 * fresh event counter values.
>> +	 */
>> +	for_each_possible_cpu(cpu) {
>> +		struct scx_event_stat *e = per_cpu_ptr(&event_stats, cpu);
>> +		memset(e, 0, sizeof(*e));
>> +	}
>> +
> 
> Wouldn't we want to keep these counters intact on ejection so that the
> scheduler's ejection path can capture the counters if necessary? Resetting
> on load probably is better.

That makes sense. I will move the resetting code on load.

Regards,
Changwoo Min