linux-kernel - Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks needing return migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANDhNCq+xE4dpecHio2x6TJXMXxhQjrDk1oCon=NR2b+k0Y9yQ@mail.gmail.com>
Date: Tue, 27 May 2025 23:24:36 -0700
From: John Stultz <jstultz@...gle.com>
To: hupu <hupu.gm@...il.com>
Cc: peterz@...radead.org, linux-kernel@...r.kernel.org, juri.lelli@...hat.com, 
	vschneid@...hat.com, mingo@...hat.com, vincent.guittot@...aro.org, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, hupu@...nssion.com
Subject: Re: [RFC 1/1] sched: Skip redundant operations for proxy tasks
 needing return migration

On Sun, May 18, 2025 at 7:51 AM hupu <hupu.gm@...il.com> wrote:
>
> Hi John,
> Sorry for the delayed response.
>
> > Usually I boot qemu with 64 cores, and have found stress testing
> > running the following separate programs frequently uncovers issues:
> >
> > #cyclictest from the rt-test suite to add some rt load
> > cyclictest -t -p90
> >
> > # rerunning the test_ww_mtuex logic (enabled by a patch in my full
> > proxy-exec series)
> > while true; do echo 1 > /sys/kernel/test_ww_mutex/run_tests ; sleep 5; done
> >
> > # Along with my prio-inversion-demo (which ensures we do the proxying paths)
> > # ? https://github.com/johnstultz-work/priority-inversion-demo
> > while true; do ./run.sh; sleep 10 ; done
> >
> > # Sometimes I also throw in the ltp sched_football test:
> > # https://github.com/linux-test-project/ltp/tree/master/testcases/realtime/func/sched_football
> > while true; do ./sched_football -m | grep Result ; sleep 1; done
> >
>
>
> I followed your suggested approach and successfully reproduced the
> issue you mentioned. The system experienced a hang during bootup and
> triggered a crash due to a watchdog timeout. This issue is related to
> the logic of sched_delayed. I will provide a detailed description of
> the debugging steps below. I apologize in advance if the explanation
> becomes somewhat lengthy.
>

Yeah. To be honest, this is a little tough to follow, particularly
trying to distinguish between the debug changes and the actual change
you are proposing. Especially as they are whitespace damaged.

[snip]
> The above patch was solely intended to verify the correctness of the
> analysis, as directly clearing `p->se.sched_delayed` is unreasonable.
> Doing so would cause tasks with non-zero lag to be dequeued, thereby
> undermining the fairness enforced by EEVDF. Moving the
> proxy_needs_return operation after the handling of sched_delayed
> appears to be more reasonable, as for a donor that is about to be
> dequeued, executing the subsequent wakeup_preempt operation is
> redundant. Therefore, the following patch seems reasonable, and I
> propose adopting it as version 2.0.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> old mode 100644
> new mode 100755
> index 06e9924d3f77..f804c681b861
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
>  #ifdef CONFIG_SCHED_PROXY_EXEC
> @@ -4164,6 +4172,10 @@ static int ttwu_runnable(struct task_struct *p,
> int wake_flags)
>   proxy_remove_from_sleeping_owner(p);
>   enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>   }
> + if (proxy_needs_return(rq, p)) {
> + _trace_sched_pe_return_migration(p);
> + goto out;
> + }
>   if (!task_on_cpu(rq, p)) {
>   /*
>   * When on_rq && !on_cpu the task is preempted, see if
> @@ -4171,10 +4183,6 @@ static int ttwu_runnable(struct task_struct *p,
> int wake_flags)
>   */
>   wakeup_preempt(rq, p, wake_flags);
>   }
> - if (proxy_needs_return(rq, p)) {
> - _trace_sched_pe_return_migration(p);
> - goto out;
> - }
>   ttwu_do_wakeup(p);
>   ret = 1;
>   }

Ok, so this matches what I had tested before modifying your earlier patch.


> However, there is a potential issue. Whether it is the original code
> logic or the v2.0 patch mentioned above, directly removing a task from
> the runqueue in proxy_needs_return undermines the fairness enforced by
> EEVDF. In fact, clearing the task's `p->se.sched_delayed` variable
> first (via enqueue_task(ENQUEUE_DELAYED) -> enqueue_task_fair ->
> requeue_delayed_entity) and then removing task from the runqueue in
> proxy_needs_return is essentially no different from directly removing
> the task while retaining its lag. This is because the non-zero lag
> will eventually be discarded. Therefore, regardless of the method
> used, the fairness maintained by EEVDF will be compromised.

Hrm. Can you walk me through the specific case you're thinking about here?

Is the idea something like:  a mutex blocked task (not sched_delayed)
gets migrated to a rq, where it acts as a donor so that a lock holder
can be run.
If the lock holder sleeps, it might be set as sched_delayed, but the
donor will be dequeued from the rq and enqueued onto the sched_delayed
sleeping owner.

And the concern is that in doing this, the donor's lag from the rq it
was migrated to won't be preserved (since it isn't set as
sched_delayed)?

I'll need to think on this a bit, as I don't quite have my head around
how mutex blocked tasks might also end up sched_delayed.

> I have not yet determined the best approach to address the issue of
> EEVDF fairness being compromised. Perhaps we could preserve the lag
> during the dequeue process and handle it during the next enqueue. I
> look forward to further discussing this issue with you.
>
> Finally, I have decided to adopt the following patch as version 2.0.
> This patch is specifically designed to avoid unnecessary
> wakeup_preempt checks. Regarding the fairness issue discussed earlier,
> I plan to submit a new patch to address it.
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> old mode 100644
> new mode 100755
> index 06e9924d3f77..f804c681b861
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
>  #ifdef CONFIG_SCHED_PROXY_EXEC
> @@ -4164,6 +4172,10 @@ static int ttwu_runnable(struct task_struct *p,
> int wake_flags)
>   proxy_remove_from_sleeping_owner(p);
>   enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>   }
> + if (proxy_needs_return(rq, p)) {
> + _trace_sched_pe_return_migration(p);
> + goto out;
> + }
>   if (!task_on_cpu(rq, p)) {
>   /*
>   * When on_rq && !on_cpu the task is preempted, see if
> @@ -4171,10 +4183,6 @@ static int ttwu_runnable(struct task_struct *p,
> int wake_flags)
>   */
>   wakeup_preempt(rq, p, wake_flags);
>   }
> - if (proxy_needs_return(rq, p)) {
> - _trace_sched_pe_return_migration(p);
> - goto out;
> - }
>   ttwu_do_wakeup(p);
>   ret = 1;
>   }

This looks identical to the version above, or am I missing something?

thanks
-john