linux-kernel - Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a41dac6f-6864-c215-0f7a-90f2126673a6@linux.microsoft.com>
Date:   Tue, 1 Sep 2020 08:34:23 -0400
From:   Vineeth Pillai <viremana@...ux.microsoft.com>
To:     Joel Fernandes <joel@...lfernandes.org>, peterz@...radead.org
Cc:     Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Dhaval Giani <dhaval.giani@...cle.com>,
        Chris Hyser <chris.hyser@...cle.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        mingo@...nel.org, tglx@...utronix.de, pjt@...gle.com,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>, vineeth@...byteword.org,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Agata Gruza <agata.gruza@...el.com>,
        Antonio Gomez Iglesias <antonio.gomez.iglesias@...el.com>,
        graf@...zon.com, konrad.wilk@...cle.com, dfaggioli@...e.com,
        rostedt@...dmis.org, derkling@...gle.com, benbjiang@...cent.com,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and
 scheduling.

Hi Joel,

On 9/1/20 1:10 AM, Joel Fernandes wrote:
> 3. The 'Rescheduling siblings' loop of pick_next_task() is quite fragile. It
> calls various functions on rq->core_pick which could very well be NULL because:
> An online sibling might have gone offline before a task could be picked for it,
> or it might be offline but later happen to come online, but its too late and
> nothing was picked for it. Just ignore the siblings for which nothing could be
> picked. This avoids any crashes that may occur in this loop that assume
> rq->core_pick is not NULL.
>
> Signed-off-by: Joel Fernandes (Google) <joel@...lfernandes.org>
I like this idea, its much simpler :-)

> ---
>   kernel/sched/core.c | 24 +++++++++++++++++++++---
>   1 file changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 717122a3dca1..4966e9f14f39 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4610,13 +4610,24 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   	if (!sched_core_enabled(rq))
>   		return __pick_next_task(rq, prev, rf);
>   
> +	cpu = cpu_of(rq);
> +
> +	/* Stopper task is switching into idle, no need core-wide selection. */
I think we can come here when hotplug thread is scheduled during online, but
mask is not yet updated. Probably can add it with this comment as well.

> +	if (cpu_is_offline(cpu))
> +		return __pick_next_task(rq, prev, rf);
> +
We would need reset core_pick here I think. Something like
     if (cpu_is_offline(cpu)) {
         rq->core_pick = NULL;
         return __pick_next_task(rq, prev, rf);
     }

Without this we can end up in a crash like this:
1. Sibling of this cpu picks a task (rq_i->core_pick) and this cpu goes
     offline soon after.
2. Before this cpu comes online, sibling goes through another pick loop
     and before its IPI loop, this cpu comes online and we get an IPI.
3. So when this cpu gets into schedule, we have core_pick set and
     core_pick_seq != core_sched_seq. So we enter the fast path. But
     core_pick might no longer in this runqueue.

So, to protect this, we should reset core_pick I think. I have seen this 
crash
occasionally.

>   	/*
>   	 * If there were no {en,de}queues since we picked (IOW, the task
>   	 * pointers are all still valid), and we haven't scheduled the last
>   	 * pick yet, do so now.
> +	 *
> +	 * rq->core_pick can be NULL if no selection was made for a CPU because
> +	 * it was either offline or went offline during a sibling's core-wide
> +	 * selection. In this case, do a core-wide selection.
>   	 */
>   	if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> -	    rq->core->core_pick_seq != rq->core_sched_seq) {
> +	    rq->core->core_pick_seq != rq->core_sched_seq &&
> +	    !rq->core_pick) {
Should this check be reversed? I mean, we should enter the fastpath if
we have rq->core_pick is set right?


Another unrelated, but related note :-)
Besides this, I think we need to retain on more change from the previous
patch. We would need to make core_pick_seq per sibling instead of per
core. Having it per core might lead to unfairness. For eg: When a cpu
sees that its sibling's core_pick is the one which is already running, it
will not send IPI. but core_pick remains set and core->core_pick_seq is
incremented. Now if the sibling is preempted due to a high priority task
or its time slice expired, it enters schedule. But it goes to fast path and
selects the running task there by starving the high priority task. Having
the core_pick_seq per sibling will avoid this. It might also help in some
hotplug corner cases as well.

Thanks,
Vineeth