lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ybvcu5RIwV+Vko09@google.com>
Date:   Thu, 16 Dec 2021 19:41:31 -0500
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Josh Don <joshdon@...gle.com>,
        Vineeth Pillai <vineethrp@...gle.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        "Chen, Tim C" <tim.c.chen@...el.com>,
        "Brown, Len" <len.brown@...el.com>,
        LKML <linux-kernel@...r.kernel.org>,
        "AubreyLi@...gle.com" <aubrey.intel@...il.com>,
        aubrey.li@...ux.intel.com, Aaron Lu <aaron.lwe@...il.com>,
        "Hyser,Chris" <chris.hyser@...cle.com>,
        Don Hiatt <dhiatt@...italocean.com>, ricardo.neri@...el.com,
        vincent.guittot@...aro.org
Cc:     joelaf@...gle.com
Subject: [RFC] High latency with core scheduling

Hello,
On ChromeOS, we see really high scheduling latency when there is a heavy
workload running outside and inside a CGroup. The load inside Cgroup is
tagged for core scheduling and happen to be vCPU threads.  Because of this
various folks are complaining.

One of the issues we see is that the core rbtree is static when nothing in
the tree goes to sleep or wakes up. This can cause the same task in the core
rbtree to be repeatedly picked in pick_task().

The below diff seems to improve the situation, could you please take a look?
If it seems sane, we can go ahead and make it a formal patch to at least fix
one of the known issues.

The patch is simple, just remove the currently running task from the core rb
tree as its vruntime is not really static. Add it back on preemption.

note: This is against a 5.4 kernel, but the code is about the same and its RFC.
note: The issue does not seem to happen without CGroups involved so perhaps
      something is wonky in cfs_prio_less() still. Peter?

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c023a9a0c4ae..3c588ad05ab6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -200,7 +200,7 @@ static inline void dump_scrb(struct rb_node *root, int lvl, char *buf, int size)
 	dump_scrb(root->rb_right, lvl+1, buf, size);
 }
 
-static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
+void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct rb_node *parent, **node;
 	struct task_struct *node_task;
@@ -212,6 +212,9 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	if (!p->core_cookie)
 		return;
 
+	if (sched_core_enqueued(p))
+		return;
+
 	node = &rq->core_tree.rb_node;
 	parent = *node;
 
@@ -232,7 +235,7 @@ static void sched_core_enqueue(struct rq *rq, struct task_struct *p)
 	rb_insert_color(&p->core_node, &rq->core_tree);
 }
 
-static void sched_core_dequeue(struct rq *rq, struct task_struct *p)
+void sched_core_dequeue(struct rq *rq, struct task_struct *p)
 {
 	rq->core->core_task_seq++;
 
@@ -4745,6 +4748,18 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma
 		return class_pick;
 
 	cookie_pick = sched_core_find(rq, cookie);
+
+	/*
+	 * Currently running process might not be in the runqueue if fair class.
+	 * If it is of the same cookie as cookie_pick and has more priority,
+	 * then select it.
+	 */
+	if (rq != this_rq() && !is_task_rq_idle(cookie_pick) && !is_task_rq_idle(rq->curr) &&
+		cookie_pick->core_cookie == rq->curr->core_cookie &&
+		prio_less(cookie_pick, rq->curr, in_fi)) {
+		cookie_pick = rq->curr;
+	}
+
 	/*
 	 * If class > max && class > cookie, it is the highest priority task on
 	 * the core (so far) and it must be selected, otherwise we must go with
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86cc67dd38e9..820c5cf4ecc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1936,15 +1936,33 @@ struct sched_class {
 #endif
 };
 
+void sched_core_enqueue(struct rq *rq, struct task_struct *p);
+void sched_core_dequeue(struct rq *rq, struct task_struct *p);
+
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
 	prev->sched_class->put_prev_task(rq, prev);
+#ifdef CONFIG_SCHED_CORE
+	if (sched_core_enabled(rq) && READ_ONCE(prev->state) != TASK_DEAD && prev->core_cookie && prev->on_rq) {
+		sched_core_enqueue(rq, prev);
+	}
+#endif
 }
 
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
 	next->sched_class->set_next_task(rq, next, false);
+#ifdef CONFIG_SCHED_CORE
+	/*
+	 * This task is going to run next and its vruntime will change.
+	 * Remove it from core rbtree so as to not confuse the ordering
+	 * in the rbtree when its vrun changes.
+	 */
+	if (sched_core_enabled(rq) && next->core_cookie && next->on_rq) {
+		sched_core_dequeue(rq, next);
+	}
+#endif
 }
 
 #ifdef CONFIG_SMP

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ