linux-kernel - Re: [RFC][PATCH 13/16] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3514bb97-33bc-3a14-3f15-767682b2f035@oracle.com>
Date:   Wed, 10 Apr 2019 17:11:59 -0700
From:   Subhra Mazumdar <subhra.mazumdar@...cle.com>
To:     Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, tim.c.chen@...ux.intel.com,
        torvalds@...ux-foundation.org
Cc:     linux-kernel@...r.kernel.org, fweisbec@...il.com,
        keescook@...omium.org, kerrnel@...gle.com,
        Vineeth Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>
Subject: Re: [RFC][PATCH 13/16] sched: Add core wide task selection and
 scheduling.


On 4/9/19 11:38 AM, Julien Desfossez wrote:
> We found the source of the major performance regression we discussed
> previously. It turns out there was a pattern where a task (a kworker in this
> case) could be woken up, but the core could still end up idle before that
> task had a chance to run.
>
> Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and
> task2 are in the same cgroup with the tag enabled (each following line
> happens in the increasing order of time):
> - task1 running on cpu0, task2 running on cpu1
> - sched_waking(kworker/0, target_cpu=cpu0)
> - task1 scheduled out of cpu0
> - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1
>    cpu0 is idle
> - task2 scheduled out of cpu1
> - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends
>    the task selection if core_cookie is NULL for currently selected process
>    and the cpu1’s runqueue.
> - cpu1 is idle
> --> both siblings are idle but kworker/0 is still in the run queue of cpu0.
>      Cpu0 may stay idle for longer if it goes deep idle.
>
> With the fix below, we ensure to send an IPI to the sibling if it is idle
> and has tasks waiting in its runqueue.
> This fixes the performance issue we were seeing.
>
> Now here is what we can measure with a disk write-intensive benchmark:
> - no performance impact with enabling core scheduling without any tagged
>    task,
> - 5% overhead if one tagged task is competing with an untagged task,
> - 10% overhead if 2 tasks tagged with a different tag are competing
>    against each other.
>
> We are starting more scaling tests, but this is very encouraging !
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e1fa10561279..02c862a5e973 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   
>   				trace_printk("unconstrained pick: %s/%d %lx\n",
>   						next->comm, next->pid, next->core_cookie);
> +				rq->core_pick = NULL;
>   
> +				/*
> +				 * If the sibling is idling, we might want to wake it
> +				 * so that it can check for any runnable but blocked tasks
> +				 * due to previous task matching.
> +				 */
> +				for_each_cpu(j, smt_mask) {
> +					struct rq *rq_j = cpu_rq(j);
> +					rq_j->core_pick = NULL;
> +					if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) {
> +						resched_curr(rq_j);
> +						trace_printk("IPI(%d->%d[%d]) idle preempt\n",
> +							     cpu, j, rq_j->nr_running);
> +					}
> +				}
>   				goto done;
>   			}
>   
I see similar improvement with this patch as removing the condition I
earlier mentioned. So that's not needed. I also included the patch for the
priority fix. For 2 DB instances, HT disabling stands at -22% for 32 users
(from earlier emails).


1 DB instance

users  baseline   %idle    core_sched %idle
16     1          84       -4.9% 84
24     1          76       -6.7% 75
32     1          69       -2.4% 69

2 DB instance

users  baseline   %idle    core_sched %idle
16     1          66       -19.5% 69
24     1          54       -9.8% 57
32     1          42       -27.2%        48