linux-kernel - Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1554835135-11814-1-git-send-email-jdesfossez@digitalocean.com>
Date:   Tue,  9 Apr 2019 14:38:55 -0400
From:   Julien Desfossez <jdesfossez@...italocean.com>
To:     Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, tim.c.chen@...ux.intel.com,
        torvalds@...ux-foundation.org
Cc:     Julien Desfossez <jdesfossez@...italocean.com>,
        linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Vineeth Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>
Subject: Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

We found the source of the major performance regression we discussed
previously. It turns out there was a pattern where a task (a kworker in this
case) could be woken up, but the core could still end up idle before that
task had a chance to run.

Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
task2 are in the same cgroup with the tag enabled (each following line
happens in the increasing order of time):
- task1 running on cpu0, task2 running on cpu1
- sched_waking(kworker/0, target_cpu=cpu0)
- task1 scheduled out of cpu0
- kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
  cpu0 is idle
- task2 scheduled out of cpu1
- cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
  the task selection if core_cookie is NULL for currently selected process
  and the cpu1’s runqueue.
- cpu1 is idle
--> both siblings are idle but kworker/0 is still in the run queue of cpu0.
    Cpu0 may stay idle for longer if it goes deep idle.

With the fix below, we ensure to send an IPI to the sibling if it is idle
and has tasks waiting in its runqueue.
This fixes the performance issue we were seeing.

Now here is what we can measure with a disk write-intensive benchmark:
- no performance impact with enabling core scheduling without any tagged
  task,
- 5% overhead if one tagged task is competing with an untagged task,
- 10% overhead if 2 tasks tagged with a different tag are competing
  against each other.

We are starting more scaling tests, but this is very encouraging !


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e1fa10561279..02c862a5e973 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 				trace_printk("unconstrained pick: %s/%d %lx\n",
 						next->comm, next->pid, next->core_cookie);
+				rq->core_pick = NULL;
 
+				/*
+				 * If the sibling is idling, we might want to wake it
+				 * so that it can check for any runnable but blocked tasks 
+				 * due to previous task matching.
+				 */
+				for_each_cpu(j, smt_mask) {
+					struct rq *rq_j = cpu_rq(j);
+					rq_j->core_pick = NULL;
+					if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
+						resched_curr(rq_j);
+						trace_printk("IPI(%d->%d[%d]) idle preempt\n",
+							     cpu, j, rq_j->nr_running);
+					}
+				}
 				goto done;
 			}