linux-kernel - Re: [PATCH v10 7/7] sched: Split scheduler and execution contexts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCrDqK=hmK3w5P74twM6YFPhp9VKsvSfbcGh1xwO1PjHNQ@mail.gmail.com>
Date: Tue, 4 Jun 2024 12:33:13 -0700
From: John Stultz <jstultz@...gle.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: LKML <linux-kernel@...r.kernel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Joel Fernandes <joelaf@...gle.com>, Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Valentin Schneider <vschneid@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Zimuzo Ezeozue <zezeozue@...gle.com>, 
	Youssef Esmat <youssefesmat@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Daniel Bristot de Oliveira <bristot@...hat.com>, Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, 
	Boqun Feng <boqun.feng@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>, 
	Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak <kprateek.nayak@....com>, 
	Metin Kaya <Metin.Kaya@....com>, Thomas Gleixner <tglx@...utronix.de>, kernel-team@...roid.com, 
	"Connor O'Brien" <connoro@...gle.com>
Subject: Re: [PATCH v10 7/7] sched: Split scheduler and execution contexts

On Tue, Jun 4, 2024 at 7:11 AM Qais Yousef <qyousef@...alina.io> wrote:
>
> On 05/06/24 21:54, John Stultz wrote:
> > From: Peter Zijlstra <peterz@...radead.org>
> >
> > Let's define the scheduling context as all the scheduler state
> > in task_struct for the task selected to run, and the execution
> > context as all state required to actually run the task.
> >
> > Currently both are intertwined in task_struct. We want to
> > logically split these such that we can use the scheduling
> > context of the task selected to be scheduled, but use the
> > execution context of a different task to actually be run.
> >
> > To this purpose, introduce rq_selected() macro to point to the
> > task_struct selected from the runqueue by the scheduler, and
> > will be used for scheduler state, and preserve rq->curr to
> > indicate the execution context of the task that will actually be
> > run.
>
> This is subjective opinion of course. But I am not a fan of rq_selected().
> I think we are better with more explicit
>
>         rq->currs       /* current sched context */
>         rq->currx       /* current execution context */

Yeah. While I've got my own opinions, I'm very flexible on names/style
details if maintainers want it any specific way.

But just personally, this strikes me as easier to misread, and not as
descriptive as the "selected" vs the "executing" task.

> and the helpers
>
>         task_curr_sctx()
>         task_curr_xctx()

Though would rq_curr_sctx() and rq_curr_xctx() make more sense, as
it's a property of the runqueue (not the task)?

> This will ensure all current users of rq->curr will generate a build error and
> be better audited what they are supposed to be. And I think the code is more
> readable this way.

So earlier I had changed it to: rq_curr() and rq_selected(), but Peter
complained about the rq_curr() macro making things indirect, so I
dropped it (though I kept the rq_selected() macro because it allowed
things to collapse down better if SCHED_PROXY_EXEC was not
configured).  I do agree the macro (along with renaming the field)
helps ensure each use is considered properly, but I also want to align
with the feedback Peter gave previously.

> Another way to do it is to split task_struct to sched and exec context (rather
> than control the reference to curr as done here). Then curr would be always the
> same, but reference to its sched context would change with inheritance.
>
> ie:
>
>         struct task_sctx {
>                 ...
>         };
>
>         struct task_struct {
>                 struct task_sctx *sctx;
>                 /* exec context is embedded as-is */
>                 ...
>         };
>
> Which means would just have to update all accessors to sched context to do
> extra dereference. Which I am not sure if a problematic extra level of
> indirection.
>
> I see the latter approach  more intuitive and explicit about what is a sched
> context that is supposed to be inherited and what is not.

Hrm. So I think I see what you're saying. Particularly with the logic
where we sometimes care about the execution context and sometimes care
about the scheduler context, I can understand wanting to better define
that separation with structures.  One thing to note from your example
above for unblocked tasks, the scheduler and execution contexts can be
the same. So you'd likely want the task_struct to not just have a
reference to the scheduler context, but its own internal scheduler
context structure as well.

That said, I'm not sure if adding the current scheduler context
reference into the task struct as you propose above is the right
approach. Particularly because while this is a generalization of
priority inheritance, it isn't actually implemented in the same way as
priority inheritance.  We don't raise and lower the lock owner task's
priority.  Instead we pick a task from the rq, and if its blocked, we
traverse the blocked_on chain to find the runnable owner.  So the two
are really properties of the rq and not really part of the task. And
remember there can be multiple blocked waiters on a mutex, and they
can both be used to run the lock owner. So one would have to be
setting the owner's scheduler context reference on each selection,
which seems a bit noisy (I also will have to think about the
serialization as you may need to hold the rq lock to keep the donor's
sched context in place (as your crossing task to task), which might be
an unintuitive/unexpected dependency).

Instead as the two contexts are really attributes of the runqueue, it
would seem like one could have what is currently rq_selected() just
return the struct task_sctx from the selected task (instead of an
entire task), but that might also cause some annoyance for things like
debugging (as having both selected and current task's comm and pid
available together are very useful for understanding what's going on).

> I generally think breaking down task structure would be good. Hopefully the new
> data perf can help us make better decision on what attributes to group
> together. And generally this might be a good exercise to rethink its content
> which grown organicaly over the year. Maybe we want to create similar such
> smaller wrapper structs for other fields too.
>
> I **think** with this approach we would just need go fix compilation errors to
> do p->sctx->{attribute}.
>
> Proxy exec later would only update the reference to this struct to enforce
> inheritance for rq->curr->sctx to point to the donor's context.

I'll need to think about this a bit more (and hopefully others can
chime in on if they see it as worth it). It would be a *lot* of churn
to tweak all the users to have the extra structure indirection, so I'm
hesitant to go running after this without wider agreement. But I do
see there would be some benefit to avoiding some of the confusion that
can come from having to deal with two full separate tasks in much of
the logic.

thanks
-john