linux-kernel - Re: [PATCH 10/11] sched_ext: Implement scx_bpf_dispatch[_vtime]_from

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZtOH1YlEgyP45UkU@gpd3>
Date: Sat, 31 Aug 2024 23:15:01 +0200
From: Andrea Righi <andrea.righi@...ux.dev>
To: Tejun Heo <tj@...nel.org>
Cc: void@...ifault.com, kernel-team@...a.com, linux-kernel@...r.kernel.org,
	Daniel Hodges <hodges.daniel.scott@...il.com>,
	Changwoo Min <multics69@...il.com>,
	Dan Schatzberg <schatzberg.dan@...il.com>
Subject: Re: [PATCH 10/11] sched_ext: Implement
 scx_bpf_dispatch[_vtime]_from_dsq()

On Sat, Aug 31, 2024 at 06:20:58AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Sat, Aug 31, 2024 at 04:30:57PM +0200, Andrea Righi wrote:
> ...
> > > @@ -5511,7 +5516,7 @@ __bpf_kfunc void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
> > >   * scx_bpf_dispatch_vtime - Dispatch a task into the vtime priority queue of a DSQ
> > >   * @p: task_struct to dispatch
> > >   * @dsq_id: DSQ to dispatch to
> > > - * @slice: duration @p can run for in nsecs
> > > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> > >   * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> > 
> > Maybe allow to keep the current vtime if 0 is passed, similar to slice?
> 
> It's tricky as 0 is a valid vtime. It's unlikely but depending on how vtime
> is defined, it may wrap in a practical amount of time. More on this below.

Ok.

> 
> ...
> > > +	/*
> > > +	 * Can be called from either ops.dispatch() locking this_rq() or any
> > > +	 * context where no rq lock is held. If latter, lock @p's task_rq which
> > > +	 * we'll likely need anyway.
> > > +	 */
> > 
> > About locking, I was wondering if we could provide a similar API
> > (scx_bpf_dispatch_lock()?) to use scx_bpf_dispatch() from any context
> > and not necessarily from ops.select_cpu() / ops.enqueue() or
> > ops.dispatch().
> > 
> > This would be really useful for user-space schedulers, since we could
> > use scx_bpf_dispatch() directly and get rid of the
> > BPF_MAP_TYPE_RINGBUFFER complexity.
> 
> One difference between scx_bpf_dispatch() and scx_bpf_dispatch_from_dsq() is
> that the former is designed to be safe to call from any context under any
> locks by doing the actual dispatches asynchronously. This is primarily to
> allow scx_bpf_dispatch() to be called under BPF locks as they are used to
> transfer the ownership of tasks from the BPF side to the kernel side. This
> makes it more difficult to make scx_bpf_dispatch() more flexible. The way
> BPF locks are currently developing, we might not have to worry about killing
> the system through deadlocks but it'd still be very prone to soft deadlocks
> that kill the BPF scheduler if implemented synchronously. Maybe the solution
> here is bouncing to an irq_work or something. I'll think more on it.

Got it. Well, the idea was to reduce complexity in the user-space
schedulers, but if we need to increase complexity in the kernel to do
so, probably it's not a good idea.

Moreover, using the BPF_MAP_TYPE_RINGBUFFER is really fast now, the
overhead is pretty close to zero, so maybe we can keep this as a low
priority todo.

> 
> ...
> > > +__bpf_kfunc bool scx_bpf_dispatch_from_dsq(struct bpf_iter_scx_dsq *it__iter,
> > > +					   struct task_struct *p, u64 dsq_id,
> > > +					   u64 slice, u64 enq_flags)
> > > +{
> > > +	return scx_dispatch_from_dsq((struct bpf_iter_scx_dsq_kern *)it__iter,
> > > +				     p, dsq_id, slice, 0, enq_flags);
> > > +}
> > > +
> > > +/**
> > > + * scx_bpf_dispatch_vtime_from_dsq - Move a task from DSQ iteration to a PRIQ DSQ
> > > + * @it__iter: DSQ iterator in progress
> > > + * @p: task to transfer
> > > + * @dsq_id: DSQ to move @p to
> > > + * @slice: duration @p can run for in nsecs, 0 to keep the current value
> > > + * @vtime: @p's ordering inside the vtime-sorted queue of the target DSQ
> > > + * @enq_flags: SCX_ENQ_*
> > 
> > Hm... can we pass 6 arguments to a kfunc? I think we're limited to 5,
> > unless I'm missing something here.
> 
> Hah, I actually don't know and didn't test the vtime variant. Maybe I should
> just drop the @slice and @vtime. They can be set by the caller explicitly
> before calling these kfuncs anyway although there are some concerns around
> ownership (ie. the caller can't be sure that the task has already been
> dispatched by someone else before scx_bpf_dispatch_from_dsq() commits). Or
> maybe I should pack the optional arguments into a struct. I'll think more
> about it.

IMHO we can simply drop them, introducing a separate struct makes the
API a bit inconsistent with scx_bpf_dispatch() (and I don't think we
want to change also scx_bpf_dispatch() for that).

About the ownership, true... maybe we can accept a bit of fuzziness
in this case, also considering that this race can happen only when using
scx_bpf_dispatch_from_dsq().

Thanks,
-Andrea