linux-kernel - Re: [RFC PATCH v3 00/16] Core scheduling v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANaguZDOb+rVcDPMS+SR1DKc73fnctkBK0EbfBrf90dztr8t=Q@mail.gmail.com>
Date:   Wed, 2 Oct 2019 16:48:14 -0400
From:   Vineeth Remanan Pillai <vpillai@...italocean.com>
To:     Aaron Lu <aaron.lu@...ux.alibaba.com>
Cc:     Tim Chen <tim.c.chen@...ux.intel.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Dario Faggioli <dfaggioli@...e.com>,
        "Li, Aubrey" <aubrey.li@...ux.intel.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On Mon, Sep 30, 2019 at 7:53 AM Vineeth Remanan Pillai
<vpillai@...italocean.com> wrote:
>
> >
> Sorry, I misunderstood the fix and I did not initially see the core wide
> min_vruntime that you tried to maintain in the rq->core. This approach
> seems reasonable. I think we can fix the potential starvation that you
> mentioned in the comment by adjusting for the difference in all the children
> cfs_rq when we set the minvruntime in rq->core. Since we take the lock for
> both the queues, it should be doable and I am trying to see how we can best
> do that.
>
Attaching here with, the 2 patches I was working on in preparation of v4.

Patch 1 is an improvement of patch 2 of Aaron where I am propagating the
vruntime changes to the whole tree.
Patch 2 is an improvement for patch 3 of Aaron where we do resched_curr
only when the sibling is forced idle.

Micro benchmarks seems good. Will be doing larger set of tests and hopefully
posting v4 by end of week. Please let me know what you think of these patches
(patch 1 is on top of Aaron's patch 2, patch 2 replaces Aaron's patch 3)

Thanks,
Vineeth

[PATCH 1/2] sched/fair: propagate the min_vruntime change to the whole rq tree

When we adjust the min_vruntime of rq->core, we need to propgate
that down the tree so as to not cause starvation of existing tasks
based on previous vruntime.
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59cb01a1563b..e8dd78a8c54d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -476,6 +476,23 @@ static inline u64 cfs_rq_min_vruntime(struct
cfs_rq *cfs_rq)
                return cfs_rq->min_vruntime;
 }

+static void coresched_adjust_vruntime(struct cfs_rq *cfs_rq, u64 delta)
+{
+       struct sched_entity *se, *next;
+
+       if (!cfs_rq)
+               return;
+
+       cfs_rq->min_vruntime -= delta;
+       rbtree_postorder_for_each_entry_safe(se, next,
+                       &cfs_rq->tasks_timeline.rb_root, run_node) {
+               if (se->vruntime > delta)
+                       se->vruntime -= delta;
+               if (se->my_q)
+                       coresched_adjust_vruntime(se->my_q, delta);
+       }
+}
+
 static void update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
 {
        struct cfs_rq *cfs_rq_core;
@@ -487,8 +504,11 @@ static void
update_core_cfs_rq_min_vruntime(struct cfs_rq *cfs_rq)
                return;

        cfs_rq_core = core_cfs_rq(cfs_rq);
-       cfs_rq_core->min_vruntime = max(cfs_rq_core->min_vruntime,
-                                       cfs_rq->min_vruntime);
+       if (cfs_rq_core != cfs_rq &&
+           cfs_rq->min_vruntime < cfs_rq_core->min_vruntime) {
+               u64 delta = cfs_rq_core->min_vruntime - cfs_rq->min_vruntime;
+               coresched_adjust_vruntime(cfs_rq_core, delta);
+       }
 }

 bool cfs_prio_less(struct task_struct *a, struct task_struct *b)
--
2.17.1

[PATCH 2/2] sched/fair : Wake up forced idle siblings if needed

If a cpu has only one task and if it has used up its timeslice,
then we should try to wake up the sibling to give the forced idle
thread a chance.
We do that by triggering schedule which will IPI the sibling if
the task in the sibling wins the priority check.
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8dd78a8c54d..ba4d929abae6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4165,6 +4165,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)
                update_min_vruntime(cfs_rq);
 }

+static inline bool
+__entity_slice_used(struct sched_entity *se)
+{
+       return (se->sum_exec_runtime - se->prev_sum_exec_runtime) >
+               sched_slice(cfs_rq_of(se), se);
+}
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -10052,6 +10059,39 @@ static void rq_offline_fair(struct rq *rq)

 #endif /* CONFIG_SMP */

+#ifdef CONFIG_SCHED_CORE
+/*
+ * If runqueue has only one task which used up its slice and
+ * if the sibling is forced idle, then trigger schedule
+ * to give forced idle task a chance.
+ */
+static void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+       int cpu = cpu_of(rq), sibling_cpu;
+       if (rq->cfs.nr_running > 1 || !__entity_slice_used(se))
+               return;
+
+       for_each_cpu(sibling_cpu, cpu_smt_mask(cpu)) {
+               struct rq *sibling_rq;
+               if (sibling_cpu == cpu)
+                       continue;
+               if (cpu_is_offline(sibling_cpu))
+                       continue;
+
+               sibling_rq = cpu_rq(sibling_cpu);
+               if (sibling_rq->core_forceidle) {
+                       resched_curr(rq);
+                       break;
+               }
+       }
+}
+#else
+static inline void resched_forceidle(struct rq *rq, struct sched_entity *se)
+{
+}
+#endif
+
+
 /*
  * scheduler tick hitting a task of our scheduling class.
  *
@@ -10075,6 +10115,9 @@ static void task_tick_fair(struct rq *rq,
struct task_struct *curr, int queued)

        update_misfit_status(curr, rq);
        update_overutilized_status(task_rq(curr));
+
+       if (sched_core_enabled(rq))
+               resched_forceidle(rq, &curr->se);
 }

 /*
--
2.17.1