linux-kernel - Re: [PATCH v10 7/7] sched: Split scheduler and execution contexts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240610195106.lsswyuutossuo45q@airbuntu>
Date: Mon, 10 Jun 2024 20:51:06 +0100
From: Qais Yousef <qyousef@...alina.io>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Joel Fernandes <joelaf@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Valentin Schneider <vschneid@...hat.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Zimuzo Ezeozue <zezeozue@...gle.com>,
	Youssef Esmat <youssefesmat@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
	Boqun Feng <boqun.feng@...il.com>,
	"Paul E. McKenney" <paulmck@...nel.org>,
	Xuewen Yan <xuewen.yan94@...il.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	Metin Kaya <Metin.Kaya@....com>,
	Thomas Gleixner <tglx@...utronix.de>, kernel-team@...roid.com,
	Connor O'Brien <connoro@...gle.com>
Subject: Re: [PATCH v10 7/7] sched: Split scheduler and execution contexts

On 06/04/24 12:33, John Stultz wrote:
> On Tue, Jun 4, 2024 at 7:11 AM Qais Yousef <qyousef@...alina.io> wrote:
> >
> > On 05/06/24 21:54, John Stultz wrote:
> > > From: Peter Zijlstra <peterz@...radead.org>
> > >
> > > Let's define the scheduling context as all the scheduler state
> > > in task_struct for the task selected to run, and the execution
> > > context as all state required to actually run the task.
> > >
> > > Currently both are intertwined in task_struct. We want to
> > > logically split these such that we can use the scheduling
> > > context of the task selected to be scheduled, but use the
> > > execution context of a different task to actually be run.
> > >
> > > To this purpose, introduce rq_selected() macro to point to the
> > > task_struct selected from the runqueue by the scheduler, and
> > > will be used for scheduler state, and preserve rq->curr to
> > > indicate the execution context of the task that will actually be
> > > run.
> >
> > This is subjective opinion of course. But I am not a fan of rq_selected().
> > I think we are better with more explicit
> >
> >         rq->currs       /* current sched context */
> >         rq->currx       /* current execution context */
> 
> Yeah. While I've got my own opinions, I'm very flexible on names/style
> details if maintainers want it any specific way.
> 
> But just personally, this strikes me as easier to misread, and not as
> descriptive as the "selected" vs the "executing" task.

rq_selected() makes me think we're returning an rq, not a task.

> 
> > and the helpers
> >
> >         task_curr_sctx()
> >         task_curr_xctx()
> 
> Though would rq_curr_sctx() and rq_curr_xctx() make more sense, as
> it's a property of the runqueue (not the task)?

Hmm maybe they could be viewed as property of the runqueue. But I think they
belong to the task. With the current suggestion we are keeping two task
references, one to get the execution context, and the other to get the
scheduling context. But in reality we have one current task, not two. I think
the fact that we look at current task should be a sign that its a property of
the task, not the runqueue otherwise the info should be stored in struct rq
directly.

IOW, rq keeps track of current task, but current task might have its scheduling
properties upgraded due to inheritance. I don't see the latter is a property of
the rq. But rq can help keep track of the two tasks to make accessing the two
properties (scheduling vs execution) easier.

> 
> > This will ensure all current users of rq->curr will generate a build error and
> > be better audited what they are supposed to be. And I think the code is more
> > readable this way.
> 
> So earlier I had changed it to: rq_curr() and rq_selected(), but Peter
> complained about the rq_curr() macro making things indirect, so I
> dropped it (though I kept the rq_selected() macro because it allowed
> things to collapse down better if SCHED_PROXY_EXEC was not
> configured).  I do agree the macro (along with renaming the field)
> helps ensure each use is considered properly, but I also want to align
> with the feedback Peter gave previously.

Yes please, let's wait for the maintainers. Don't want to waste your time with
changes that might not be agreeable for sure :-)

> 
> > Another way to do it is to split task_struct to sched and exec context (rather
> > than control the reference to curr as done here). Then curr would be always the
> > same, but reference to its sched context would change with inheritance.
> >
> > ie:
> >
> >         struct task_sctx {
> >                 ...
> >         };
> >
> >         struct task_struct {
> >                 struct task_sctx *sctx;
> >                 /* exec context is embedded as-is */
> >                 ...
> >         };
> >
> > Which means would just have to update all accessors to sched context to do
> > extra dereference. Which I am not sure if a problematic extra level of
> > indirection.
> >
> > I see the latter approach  more intuitive and explicit about what is a sched
> > context that is supposed to be inherited and what is not.
> 
> Hrm. So I think I see what you're saying. Particularly with the logic
> where we sometimes care about the execution context and sometimes care
> about the scheduler context, I can understand wanting to better define
> that separation with structures.  One thing to note from your example
> above for unblocked tasks, the scheduler and execution contexts can be
> the same. So you'd likely want the task_struct to not just have a
> reference to the scheduler context, but its own internal scheduler
> context structure as well.
> 
> That said, I'm not sure if adding the current scheduler context
> reference into the task struct as you propose above is the right
> approach. Particularly because while this is a generalization of
> priority inheritance, it isn't actually implemented in the same way as
> priority inheritance.  We don't raise and lower the lock owner task's
> priority.  Instead we pick a task from the rq, and if its blocked, we
> traverse the blocked_on chain to find the runnable owner.  So the two
> are really properties of the rq and not really part of the task. And
> remember there can be multiple blocked waiters on a mutex, and they
> can both be used to run the lock owner. So one would have to be

Hmm I would have thought this should still work the same. But we'd end up with
maintaining current's scheduling context. In my head what should different is
what is being updated, rq->currs/rq->currx or rq->curr->sctx.

But you've been looking at this for long enough now and I could be missing
some (many) things for sure.

> setting the owner's scheduler context reference on each selection,
> which seems a bit noisy (I also will have to think about the
> serialization as you may need to hold the rq lock to keep the donor's
> sched context in place (as your crossing task to task), which might be
> an unintuitive/unexpected dependency).

It could be oversimplified in my head..

> 
> Instead as the two contexts are really attributes of the runqueue, it

I guess I am struggling to see the contexts are attributes of the runqueue. The
current task, yes. But what scheduling context the task has, is its own
property the way I see it.

> would seem like one could have what is currently rq_selected() just
> return the struct task_sctx from the selected task (instead of an
> entire task), but that might also cause some annoyance for things like
> debugging (as having both selected and current task's comm and pid
> available together are very useful for understanding what's going on).
> 
> > I generally think breaking down task structure would be good. Hopefully the new
> > data perf can help us make better decision on what attributes to group
> > together. And generally this might be a good exercise to rethink its content
> > which grown organicaly over the year. Maybe we want to create similar such
> > smaller wrapper structs for other fields too.
> >
> > I **think** with this approach we would just need go fix compilation errors to
> > do p->sctx->{attribute}.
> >
> > Proxy exec later would only update the reference to this struct to enforce
> > inheritance for rq->curr->sctx to point to the donor's context.
> 
> I'll need to think about this a bit more (and hopefully others can
> chime in on if they see it as worth it). It would be a *lot* of churn
> to tweak all the users to have the extra structure indirection, so I'm
> hesitant to go running after this without wider agreement. But I do
> see there would be some benefit to avoiding some of the confusion that
> can come from having to deal with two full separate tasks in much of
> the logic.

Don't change anything yet please :-) I could be missing a lot of details. But
I think for now it's good to know why one alternative could potentially be
better than the other without any specific action. It's the type of detail that
can be changed at anytime later - but for me I see these as properties of the
tasks and that's what promoted me to enquire why not done that way instead.

The proposed way does seem the fastest path to get something working, so I am
not against it. But I had to ask if it is the best thing longer term.


Thanks!

--
Qais Yousef