linux-kernel - Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a76adfd2-a17d-4342-af7e-5d17cf10dab7@arm.com>
Date: Thu, 21 Dec 2023 10:43:54 +0000
From: Metin Kaya <metin.kaya@....com>
To: John Stultz <jstultz@...gle.com>, LKML <linux-kernel@...r.kernel.org>
Cc: Peter Zijlstra <peterz@...radead.org>, Joel Fernandes
 <joelaf@...gle.com>, Qais Yousef <qyousef@...gle.com>,
 Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Valentin Schneider <vschneid@...hat.com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Zimuzo Ezeozue <zezeozue@...gle.com>, Youssef Esmat
 <youssefesmat@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Daniel Bristot de Oliveira <bristot@...hat.com>,
 Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>,
 Boqun Feng <boqun.feng@...il.com>, "Paul E. McKenney" <paulmck@...nel.org>,
 Xuewen Yan <xuewen.yan94@...il.com>, K Prateek Nayak
 <kprateek.nayak@....com>, Thomas Gleixner <tglx@...utronix.de>,
 kernel-team@...roid.com, Connor O'Brien <connoro@...gle.com>
Subject: Re: [PATCH v7 08/23] sched: Split scheduler and execution contexts

On 20/12/2023 12:18 am, John Stultz wrote:
> From: Peter Zijlstra <peterz@...radead.org>
> 
> Let's define the scheduling context as all the scheduler state
> in task_struct for the task selected to run, and the execution
> context as all state required to actually run the task.
> 
> Currently both are intertwined in task_struct. We want to
> logically split these such that we can use the scheduling
> context of the task selected to be scheduled, but use the
> execution context of a different task to actually be run.

Should we update Documentation/kernel-hacking/hacking.rst (line #348: 
:c:macro:`current`) or another appropriate doc to announce separation of 
scheduling & execution contexts?

> 
> To this purpose, introduce rq_selected() macro to point to the
> task_struct selected from the runqueue by the scheduler, and
> will be used for scheduler state, and preserve rq->curr to
> indicate the execution context of the task that will actually be
> run.
> 
> NOTE: Peter previously mentioned he didn't like the name
> "rq_selected()", but I've not come up with a better alternative.
> I'm very open to other name proposals.
> 
> Question for Peter: Dietmar suggested you'd prefer I drop the
> conditionalization of the scheduler context pointer on the rq
> (so rq_selected() would be open coded as rq->curr_selected or
> whatever we agree on for a name), but I'd think in the
> !CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
> and its use (since it curr_selected would always be == curr)?
> If I'm wrong I'm fine switching this, but would appreciate
> clarification.
> 
> Cc: Joel Fernandes <joelaf@...gle.com>
> Cc: Qais Yousef <qyousef@...gle.com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Juri Lelli <juri.lelli@...hat.com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Zimuzo Ezeozue <zezeozue@...gle.com>
> Cc: Youssef Esmat <youssefesmat@...gle.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Will Deacon <will@...nel.org>
> Cc: Waiman Long <longman@...hat.com>
> Cc: Boqun Feng <boqun.feng@...il.com>
> Cc: "Paul E. McKenney" <paulmck@...nel.org>
> Cc: Xuewen Yan <xuewen.yan94@...il.com>
> Cc: K Prateek Nayak <kprateek.nayak@....com>
> Cc: Metin Kaya <Metin.Kaya@....com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: kernel-team@...roid.com
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Signed-off-by: Juri Lelli <juri.lelli@...hat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
> [add additional comments and update more sched_class code to use
>   rq::proxy]
> Signed-off-by: Connor O'Brien <connoro@...gle.com>
> [jstultz: Rebased and resolved minor collisions, reworked to use
>   accessors, tweaked update_curr_common to use rq_proxy fixing rt
>   scheduling issues]
> Signed-off-by: John Stultz <jstultz@...gle.com>
> ---
> v2:
> * Reworked to use accessors
> * Fixed update_curr_common to use proxy instead of curr
> v3:
> * Tweaked wrapper names
> * Swapped proxy for selected for clarity
> v4:
> * Minor variable name tweaks for readability
> * Use a macro instead of a inline function and drop
>    other helper functions as suggested by Peter.
> * Remove verbose comments/questions to avoid review
>    distractions, as suggested by Dietmar
> v5:
> * Add CONFIG_PROXY_EXEC option to this patch so the
>    new logic can be tested with this change
> * Minor fix to grab rq_selected when holding the rq lock
> v7:
> * Minor spelling fix and unused argument fixes suggested by
>    Metin Kaya
> * Switch to curr_selected for consistency, and minor rewording
>    of commit message for clarity
> * Rename variables selected instead of curr when we're using
>    rq_selected()
> * Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
>    as suggested by Metin Kaya
> ---
>   kernel/sched/core.c     | 46 ++++++++++++++++++++++++++---------------
>   kernel/sched/deadline.c | 35 ++++++++++++++++---------------
>   kernel/sched/fair.c     | 18 ++++++++--------
>   kernel/sched/rt.c       | 40 +++++++++++++++++------------------
>   kernel/sched/sched.h    | 35 +++++++++++++++++++++++++++++--
>   5 files changed, 109 insertions(+), 65 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e06558fb08aa..0ce34f5c0e0c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -822,7 +822,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
>   
>   	rq_lock(rq, &rf);
>   	update_rq_clock(rq);
> -	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
> +	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
>   	rq_unlock(rq, &rf);
>   
>   	return HRTIMER_NORESTART;
> @@ -2242,16 +2242,18 @@ static inline void check_class_changed(struct rq *rq, struct task_struct *p,
>   
>   void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
>   {
> -	if (p->sched_class == rq->curr->sched_class)
> -		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
> -	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
> +	struct task_struct *selected = rq_selected(rq);
> +
> +	if (p->sched_class == selected->sched_class)
> +		selected->sched_class->wakeup_preempt(rq, p, flags);
> +	else if (sched_class_above(p->sched_class, selected->sched_class))
>   		resched_curr(rq);
>   
>   	/*
>   	 * A queue event has occurred, and we're going to schedule.  In
>   	 * this case, we can save a useless back to back clock update.
>   	 */
> -	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
> +	if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
>   		rq_clock_skip_update(rq);
>   }
>   
> @@ -2780,7 +2782,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
>   		lockdep_assert_held(&p->pi_lock);
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   
>   	if (queued) {
>   		/*
> @@ -5600,7 +5602,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
>   	 * project cycles that may never be accounted to this
>   	 * thread, breaking clock_gettime().
>   	 */
> -	if (task_current(rq, p) && task_on_rq_queued(p)) {
> +	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
>   		prefetch_curr_exec_start(p);
>   		update_rq_clock(rq);
>   		p->sched_class->update_curr(rq);
> @@ -5668,7 +5670,8 @@ void scheduler_tick(void)
>   {
>   	int cpu = smp_processor_id();
>   	struct rq *rq = cpu_rq(cpu);
> -	struct task_struct *curr = rq->curr;
> +	/* accounting goes to the selected task */
> +	struct task_struct *selected;
>   	struct rq_flags rf;
>   	unsigned long thermal_pressure;
>   	u64 resched_latency;
> @@ -5679,16 +5682,17 @@ void scheduler_tick(void)
>   	sched_clock_tick();
>   
>   	rq_lock(rq, &rf);
> +	selected = rq_selected(rq);
>   
>   	update_rq_clock(rq);
>   	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>   	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> -	curr->sched_class->task_tick(rq, curr, 0);
> +	selected->sched_class->task_tick(rq, selected, 0);
>   	if (sched_feat(LATENCY_WARN))
>   		resched_latency = cpu_resched_latency(rq);
>   	calc_global_load_tick(rq);
>   	sched_core_tick(rq);
> -	task_tick_mm_cid(rq, curr);
> +	task_tick_mm_cid(rq, selected);
>   
>   	rq_unlock(rq, &rf);
>   
> @@ -5697,8 +5701,8 @@ void scheduler_tick(void)
>   
>   	perf_event_task_tick();
>   
> -	if (curr->flags & PF_WQ_WORKER)
> -		wq_worker_tick(curr);
> +	if (selected->flags & PF_WQ_WORKER)
> +		wq_worker_tick(selected);
>   
>   #ifdef CONFIG_SMP
>   	rq->idle_balance = idle_cpu(cpu);
> @@ -5763,6 +5767,12 @@ static void sched_tick_remote(struct work_struct *work)
>   		struct task_struct *curr = rq->curr;
>   
>   		if (cpu_online(cpu)) {
> +			/*
> +			 * Since this is a remote tick for full dynticks mode,
> +			 * we are always sure that there is no proxy (only a
> +			 * single task is running).
> +			 */
> +			SCHED_WARN_ON(rq->curr != rq_selected(rq));
>   			update_rq_clock(rq);
>   
>   			if (!is_idle_task(curr)) {
> @@ -6685,6 +6695,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>   	}
>   
>   	next = pick_next_task(rq, prev, &rf);
> +	rq_set_selected(rq, next);
>   	clear_tsk_need_resched(prev);
>   	clear_preempt_need_resched();
>   #ifdef CONFIG_SCHED_DEBUG
> @@ -7185,7 +7196,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
>   
>   	prev_class = p->sched_class;
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, queue_flag);
>   	if (running)
> @@ -7275,7 +7286,7 @@ void set_user_nice(struct task_struct *p, long nice)
>   	}
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
>   	if (running)
> @@ -7868,7 +7879,7 @@ static int __sched_setscheduler(struct task_struct *p,
>   	}
>   
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   	if (queued)
>   		dequeue_task(rq, p, queue_flags);
>   	if (running)
> @@ -9295,6 +9306,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
>   	rcu_read_unlock();
>   
>   	rq->idle = idle;
> +	rq_set_selected(rq, idle);
>   	rcu_assign_pointer(rq->curr, idle);
>   	idle->on_rq = TASK_ON_RQ_QUEUED;
>   #ifdef CONFIG_SMP
> @@ -9384,7 +9396,7 @@ void sched_setnuma(struct task_struct *p, int nid)
>   
>   	rq = task_rq_lock(p, &rf);
>   	queued = task_on_rq_queued(p);
> -	running = task_current(rq, p);
> +	running = task_current_selected(rq, p);
>   
>   	if (queued)
>   		dequeue_task(rq, p, DEQUEUE_SAVE);
> @@ -10489,7 +10501,7 @@ void sched_move_task(struct task_struct *tsk)
>   
>   	update_rq_clock(rq);
>   
> -	running = task_current(rq, tsk);
> +	running = task_current_selected(rq, tsk);
>   	queued = task_on_rq_queued(tsk);
>   
>   	if (queued)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 6140f1f51da1..9cf20f4ac5f9 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1150,7 +1150,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
>   #endif
>   
>   	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
> -	if (dl_task(rq->curr))
> +	if (dl_task(rq_selected(rq)))
>   		wakeup_preempt_dl(rq, p, 0);
>   	else
>   		resched_curr(rq);
> @@ -1273,7 +1273,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>    */
>   static void update_curr_dl(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_dl_entity *dl_se = &curr->dl;
>   	s64 delta_exec, scaled_delta_exec;
>   	int cpu = cpu_of(rq);
> @@ -1784,7 +1784,7 @@ static int find_later_rq(struct task_struct *task);
>   static int
>   select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   {
> -	struct task_struct *curr;
> +	struct task_struct *curr, *selected;
>   	bool select_rq;
>   	struct rq *rq;
>   
> @@ -1795,6 +1795,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   
>   	rcu_read_lock();
>   	curr = READ_ONCE(rq->curr); /* unlocked access */
> +	selected = READ_ONCE(rq_selected(rq));
>   
>   	/*
>   	 * If we are dealing with a -deadline task, we must
> @@ -1805,9 +1806,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
>   	 * other hand, if it has a shorter deadline, we
>   	 * try to make it stay here, it might be important.
>   	 */
> -	select_rq = unlikely(dl_task(curr)) &&
> +	select_rq = unlikely(dl_task(selected)) &&
>   		    (curr->nr_cpus_allowed < 2 ||
> -		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
> +		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
>   		    p->nr_cpus_allowed > 1;
>   
>   	/*
> @@ -1870,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
>   	 * let's hope p can move out.
>   	 */
>   	if (rq->curr->nr_cpus_allowed == 1 ||
> -	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
> +	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
>   		return;
>   
>   	/*
> @@ -1909,7 +1910,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>   static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>   				  int flags)
>   {
> -	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
> +	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
>   		resched_curr(rq);
>   		return;
>   	}
> @@ -1919,7 +1920,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
>   	 * In the unlikely case current and p have the same deadline
>   	 * let us try to decide what's the best thing to do...
>   	 */
> -	if ((p->dl.deadline == rq->curr->dl.deadline) &&
> +	if ((p->dl.deadline == rq_selected(rq)->dl.deadline) &&
>   	    !test_tsk_need_resched(rq->curr))
>   		check_preempt_equal_dl(rq, p);
>   #endif /* CONFIG_SMP */
> @@ -1954,7 +1955,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
>   	if (hrtick_enabled_dl(rq))
>   		start_hrtick_dl(rq, p);
>   
> -	if (rq->curr->sched_class != &dl_sched_class)
> +	if (rq_selected(rq)->sched_class != &dl_sched_class)
>   		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>   
>   	deadline_queue_push_tasks(rq);
> @@ -2268,8 +2269,8 @@ static int push_dl_task(struct rq *rq)
>   	 * can move away, it makes sense to just reschedule
>   	 * without going further in pushing next_task.
>   	 */
> -	if (dl_task(rq->curr) &&
> -	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
> +	if (dl_task(rq_selected(rq)) &&
> +	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) &&
>   	    rq->curr->nr_cpus_allowed > 1) {
>   		resched_curr(rq);
>   		return 0;
> @@ -2394,7 +2395,7 @@ static void pull_dl_task(struct rq *this_rq)
>   			 * deadline than the current task of its runqueue.
>   			 */
>   			if (dl_time_before(p->dl.deadline,
> -					   src_rq->curr->dl.deadline))
> +					   rq_selected(src_rq)->dl.deadline))
>   				goto skip;
>   
>   			if (is_migration_disabled(p)) {
> @@ -2435,9 +2436,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
>   	if (!task_on_cpu(rq, p) &&
>   	    !test_tsk_need_resched(rq->curr) &&
>   	    p->nr_cpus_allowed > 1 &&
> -	    dl_task(rq->curr) &&
> +	    dl_task(rq_selected(rq)) &&
>   	    (rq->curr->nr_cpus_allowed < 2 ||
> -	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
> +	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
>   		push_dl_tasks(rq);
>   	}
>   }
> @@ -2612,12 +2613,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
>   		return;
>   	}
>   
> -	if (rq->curr != p) {
> +	if (rq_selected(rq) != p) {
>   #ifdef CONFIG_SMP
>   		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
>   			deadline_queue_push_tasks(rq);
>   #endif
> -		if (dl_task(rq->curr))
> +		if (dl_task(rq_selected(rq)))
>   			wakeup_preempt_dl(rq, p, 0);
>   		else
>   			resched_curr(rq);
> @@ -2646,7 +2647,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
>   	if (!rq->dl.overloaded)
>   		deadline_queue_pull_task(rq);
>   
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   		/*
>   		 * If we now have a earlier deadline task than p,
>   		 * then reschedule, provided p is still on this
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1251fd01a555..07216ea3ed53 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1157,7 +1157,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>    */
>   s64 update_curr_common(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	s64 delta_exec;
>   
>   	delta_exec = update_curr_se(rq, &curr->se);
> @@ -1203,7 +1203,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>   
>   static void update_curr_fair(struct rq *rq)
>   {
> -	update_curr(cfs_rq_of(&rq->curr->se));
> +	update_curr(cfs_rq_of(&rq_selected(rq)->se));
>   }
>   
>   static inline void
> @@ -6611,7 +6611,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>   		s64 delta = slice - ran;
>   
>   		if (delta < 0) {
> -			if (task_current(rq, p))
> +			if (task_current_selected(rq, p))
>   				resched_curr(rq);
>   			return;
>   		}
> @@ -6626,7 +6626,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
>    */
>   static void hrtick_update(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   
>   	if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class)
>   		return;
> @@ -8235,7 +8235,7 @@ static void set_next_buddy(struct sched_entity *se)
>    */
>   static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_entity *se = &curr->se, *pse = &p->se;
>   	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>   	int next_buddy_marked = 0;
> @@ -8268,7 +8268,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>   	 * prevents us from potentially nominating it as a false LAST_BUDDY
>   	 * below.
>   	 */
> -	if (test_tsk_need_resched(curr))
> +	if (test_tsk_need_resched(rq->curr))
>   		return;
>   
>   	/* Idle tasks are by definition preempted by non-idle tasks. */
> @@ -9252,7 +9252,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>   	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
>   	 * DL and IRQ signals have been updated before updating CFS.
>   	 */
> -	curr_class = rq->curr->sched_class;
> +	curr_class = rq_selected(rq)->sched_class;
>   
>   	thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>   
> @@ -12640,7 +12640,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>   	 * our priority decreased, or if we are not currently running on
>   	 * this runqueue and our priority is higher than the current's
>   	 */
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   		if (p->prio > oldprio)
>   			resched_curr(rq);
>   	} else
> @@ -12743,7 +12743,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
>   		 * kick off the schedule if running, otherwise just see
>   		 * if we can still preempt the current task.
>   		 */
> -		if (task_current(rq, p))
> +		if (task_current_selected(rq, p))
>   			resched_curr(rq);
>   		else
>   			wakeup_preempt(rq, p, 0);
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 9cdea3ea47da..2682cec45aaa 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
>   
>   static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
>   {
> -	struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr;
> +	struct task_struct *curr = rq_selected(rq_of_rt_rq(rt_rq));
>   	struct rq *rq = rq_of_rt_rq(rt_rq);
>   	struct sched_rt_entity *rt_se;
>   
> @@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
>    */
>   static void update_curr_rt(struct rq *rq)
>   {
> -	struct task_struct *curr = rq->curr;
> +	struct task_struct *curr = rq_selected(rq);
>   	struct sched_rt_entity *rt_se = &curr->rt;
>   	s64 delta_exec;
>   
> @@ -1545,7 +1545,7 @@ static int find_lowest_rq(struct task_struct *task);
>   static int
>   select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   {
> -	struct task_struct *curr;
> +	struct task_struct *curr, *selected;
>   	struct rq *rq;
>   	bool test;
>   
> @@ -1557,6 +1557,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   
>   	rcu_read_lock();
>   	curr = READ_ONCE(rq->curr); /* unlocked access */
> +	selected = READ_ONCE(rq_selected(rq));
>   
>   	/*
>   	 * If the current task on @p's runqueue is an RT task, then
> @@ -1585,8 +1586,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   	 * systems like big.LITTLE.
>   	 */
>   	test = curr &&
> -	       unlikely(rt_task(curr)) &&
> -	       (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> +	       unlikely(rt_task(selected)) &&
> +	       (curr->nr_cpus_allowed < 2 || selected->prio <= p->prio);
>   
>   	if (test || !rt_task_fits_capacity(p, cpu)) {
>   		int target = find_lowest_rq(p);
> @@ -1616,12 +1617,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
>   
>   static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
>   {
> -	/*
> -	 * Current can't be migrated, useless to reschedule,
> -	 * let's hope p can move out.
> -	 */
>   	if (rq->curr->nr_cpus_allowed == 1 ||
> -	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
> +	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
>   		return;
>   
>   	/*
> @@ -1664,7 +1661,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
>    */
>   static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>   {
> -	if (p->prio < rq->curr->prio) {
> +	struct task_struct *curr = rq_selected(rq);
> +
> +	if (p->prio < curr->prio) {
>   		resched_curr(rq);
>   		return;
>   	}
> @@ -1682,7 +1681,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>   	 * to move current somewhere else, making room for our non-migratable
>   	 * task.
>   	 */
> -	if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr))
> +	if (p->prio == curr->prio && !test_tsk_need_resched(rq->curr))
>   		check_preempt_equal_prio(rq, p);
>   #endif
>   }
> @@ -1707,7 +1706,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
>   	 * utilization. We only care of the case where we start to schedule a
>   	 * rt task
>   	 */
> -	if (rq->curr->sched_class != &rt_sched_class)
> +	if (rq_selected(rq)->sched_class != &rt_sched_class)
>   		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
>   
>   	rt_queue_push_tasks(rq);
> @@ -1988,6 +1987,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
>   
>   	BUG_ON(rq->cpu != task_cpu(p));
>   	BUG_ON(task_current(rq, p));
> +	BUG_ON(task_current_selected(rq, p));
>   	BUG_ON(p->nr_cpus_allowed <= 1);
>   
>   	BUG_ON(!task_on_rq_queued(p));
> @@ -2020,7 +2020,7 @@ static int push_rt_task(struct rq *rq, bool pull)
>   	 * higher priority than current. If that's the case
>   	 * just reschedule current.
>   	 */
> -	if (unlikely(next_task->prio < rq->curr->prio)) {
> +	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
>   		resched_curr(rq);
>   		return 0;
>   	}
> @@ -2375,7 +2375,7 @@ static void pull_rt_task(struct rq *this_rq)
>   			 * p if it is lower in priority than the
>   			 * current task on the run queue
>   			 */
> -			if (p->prio < src_rq->curr->prio)
> +			if (p->prio < rq_selected(src_rq)->prio)
>   				goto skip;
>   
>   			if (is_migration_disabled(p)) {
> @@ -2419,9 +2419,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
>   	bool need_to_push = !task_on_cpu(rq, p) &&
>   			    !test_tsk_need_resched(rq->curr) &&
>   			    p->nr_cpus_allowed > 1 &&
> -			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
> +			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
>   			    (rq->curr->nr_cpus_allowed < 2 ||
> -			     rq->curr->prio <= p->prio);
> +			     rq_selected(rq)->prio <= p->prio);
>   
>   	if (need_to_push)
>   		push_rt_tasks(rq);
> @@ -2505,7 +2505,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
>   		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
>   			rt_queue_push_tasks(rq);
>   #endif /* CONFIG_SMP */
> -		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
> +		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
>   			resched_curr(rq);
>   	}
>   }
> @@ -2520,7 +2520,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
>   	if (!task_on_rq_queued(p))
>   		return;
>   
> -	if (task_current(rq, p)) {
> +	if (task_current_selected(rq, p)) {
>   #ifdef CONFIG_SMP
>   		/*
>   		 * If our priority decreases while running, we
> @@ -2546,7 +2546,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
>   		 * greater than the current running task
>   		 * then reschedule.
>   		 */
> -		if (p->prio < rq->curr->prio)
> +		if (p->prio < rq_selected(rq)->prio)
>   			resched_curr(rq);
>   	}
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3e0e4fc8734b..6ea1dfbe502a 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -994,7 +994,10 @@ struct rq {
>   	 */
>   	unsigned int		nr_uninterruptible;
>   
> -	struct task_struct __rcu	*curr;
> +	struct task_struct __rcu	*curr;       /* Execution context */
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	struct task_struct __rcu	*curr_selected; /* Scheduling context (policy) */
> +#endif
>   	struct task_struct	*idle;
>   	struct task_struct	*stop;
>   	unsigned long		next_balance;
> @@ -1189,6 +1192,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>   #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
>   #define raw_rq()		raw_cpu_ptr(&runqueues)
>   
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +#define rq_selected(rq)		((rq)->curr_selected)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> +	rcu_assign_pointer(rq->curr_selected, t);
> +}
> +#else
> +#define rq_selected(rq)		((rq)->curr)
> +static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
> +{
> +	/* Do nothing */
> +}
> +#endif
> +
>   struct sched_group;
>   #ifdef CONFIG_SCHED_CORE
>   static inline struct cpumask *sched_group_span(struct sched_group *sg);
> @@ -2112,11 +2129,25 @@ static inline u64 global_rt_runtime(void)
>   	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
>   }
>   
> +/*
> + * Is p the current execution context?
> + */
>   static inline int task_current(struct rq *rq, struct task_struct *p)
>   {
>   	return rq->curr == p;
>   }
>   
> +/*
> + * Is p the current scheduling context?
> + *
> + * Note that it might be the current execution context at the same time if
> + * rq->curr == rq_selected() == p.
> + */
> +static inline int task_current_selected(struct rq *rq, struct task_struct *p)
> +{
> +	return rq_selected(rq) == p;
> +}
> +
>   static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
>   {
>   #ifdef CONFIG_SMP
> @@ -2280,7 +2311,7 @@ struct sched_class {
>   
>   static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>   {
> -	WARN_ON_ONCE(rq->curr != prev);
> +	WARN_ON_ONCE(rq_selected(rq) != prev);
>   	prev->sched_class->put_prev_task(rq, prev);
>   }
>