linux-kernel - Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks needing return migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADHxFxQ8GqndiKGT2z2aUFU5qQQSU1QxQR1CrHsaa8ShrJ6D+Q@mail.gmail.com>
Date: Thu, 5 Jun 2025 22:50:15 +0800
From: hupu <hupu.gm@...il.com>
To: John Stultz <jstultz@...gle.com>
Cc: peterz@...radead.org, linux-kernel@...r.kernel.org, juri.lelli@...hat.com, 
	vschneid@...hat.com, mingo@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, hupu@...nssion.com
Subject: Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks
 needing return migration

Hi John

>
> > However, there is a potential issue. Whether it is the original code
> > logic or the v2.0 patch mentioned above, directly removing a task from
> > the runqueue in proxy_needs_return undermines the fairness enforced by
> > EEVDF. In fact, clearing the task's `p->se.sched_delayed` variable
> > first (via enqueue_task(ENQUEUE_DELAYED) -> enqueue_task_fair ->
> > requeue_delayed_entity) and then removing task from the runqueue in
> > proxy_needs_return is essentially no different from directly removing
> > the task while retaining its lag. This is because the non-zero lag
> > will eventually be discarded. Therefore, regardless of the method
> > used, the fairness maintained by EEVDF will be compromised.
>
> Hrm. Can you walk me through the specific case you're thinking about here?
>
> Is the idea something like:  a mutex blocked task (not sched_delayed)
> gets migrated to a rq, where it acts as a donor so that a lock holder
> can be run.
> If the lock holder sleeps, it might be set as sched_delayed, but the
> donor will be dequeued from the rq and enqueued onto the sched_delayed
> sleeping owner.
>
> And the concern is that in doing this, the donor's lag from the rq it
> was migrated to won't be preserved (since it isn't set as
> sched_delayed)?
>
> I'll need to think on this a bit, as I don't quite have my head around
> how mutex blocked tasks might also end up sched_delayed.
>

There may have been an issue with my earlier analysis, and I apologize
for my oversight.

During previous debugging, I noticed that when handling the
sched_delayed logic in the ttwu_runnable function, there were
instances where the donor's vlag was reset to zero. This raised
concerns for me, as I worried that directly resetting the vlag might
compromise the fairness ensured by EEVDF.

However, upon closer inspection of the code, I realized that this is
actually a normal adjustment of the vlag. The purpose of this process
is to check whether the sched_delayed task has repaid its past "debt"
and determine whether its position in the red-black tree needs to be
updated. The corresponding call stack is as follows:

ttwu_runnable
  |-- if (p->se.sched_delayed) enqueue_task(flags=ENQUEUE_DELAYED)
      |-- enqueue_task_fair(flags=ENQUEUE_DELAYED)
          |-- requeue_delayed_entity(se)
              |-- update_entity_lag()
                  |-- se->vlag = entity_lag(avg_vruntime(cfs_rq), se);

Therefore, the logic in ttwu_runnable does not compromise the fairness of EEVDF.


However, the above call stack gives me another idea: Executing the
sched_delayed logic seems redundant if a donor task needs to return
back. This is because the two main works in the sched_delayed logic —
"updating the vlag" and "adjusting the task's position in the
red-black tree" — can be deferred to later stages, as shown in the
call stack below:

a) STEP#1: Update the donor's vlag.
b) STEP#2: Update the donor's vruntime and deadline.
c) STEP#3: Update the donor's position in the red-black tree.

try_to_wake_up
  |-- ttwu_runnable
  |   |-- proxy_needs_return
  |       |-- deactivate_task
  |           |-- dequeue_task
  |               |-- dequeue_task_fair
  |                   |-- dequeue_entities
  |                       |-- dequeue_entity
  |                           |-- update_entity_lag         (STEP #1)
  |                               |-- se->vlag = entity_lag()
  |
  |-- new_cpu = select_task_rq  //Re-select a new CPU for the donor task
  |-- ttwu_queue(p, new_cpu)
      |-- ttwu_do_activate
          |-- activate_task(flags=ENQUEUE_WAKEUP)
              |-- __activate_task(flags=ENQUEUE_WAKEUP)
                  |-- enqueue_task(flags=ENQUEUE_WAKEUP)
                      |-- enqueue_task_fair(flags=ENQUEUE_WAKEUP)
                          |-- enqueue_entity(flags=ENQUEUE_WAKEUP)
                              |-- place_entity              (STEP #2)
                              |-- __enqueue_entity          (STEP #3)


In other words, if the donor task needs to be migrated back, the
sched_delayed logic can be skipped entirely. Therefore, the final
PATCH should be as follows.
In this PATCH, proxy_needs_return is moved ahead of the sched_delayed
logic, and the p->se.sched_delayed variable is forcefully set to 0.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
old mode 100644
new mode 100755
index 06e9924d3f77..f585e28b19eb
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4160,6 +4160,18 @@ static int ttwu_runnable(struct task_struct *p,
int wake_flags)
        rq = __task_rq_lock(p, &rf);
        if (task_on_rq_queued(p)) {
                update_rq_clock(rq);
+               if (proxy_needs_return(rq, p)) {
+                       /*
+                        * Force clear, as the work of updating the donor's vlag
+                        * and its position in the red-black tree can
be deferred
+                        * to later stages.
+                        */
+                       p->se.sched_delayed = 0;
+
+                       _trace_sched_pe_return_migration(p);
+                       goto out;
+               }
+
                if (p->se.sched_delayed) {
                        proxy_remove_from_sleeping_owner(p);
                        enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
@@ -4171,10 +4183,6 @@ static int ttwu_runnable(struct task_struct *p,
int wake_flags)
                         */
                        wakeup_preempt(rq, p, wake_flags);
                }
-               if (proxy_needs_return(rq, p)) {
-                       _trace_sched_pe_return_migration(p);
-                       goto out;
-               }
                ttwu_do_wakeup(p);
                ret = 1;
        }


This PATCH can boot up successfully. Furthermore, I triggered the
mutex-related test cases by executing the following command, and it
has been running for approximately 20 hours without any issues.
# while true; do echo 1 > /sys/kernel/test_ww_mutex/run_tests; sleep 10; done


Looking forward to your reply. Thanks.
hupu