linux-kernel - Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZQR1tovjkTZ0VcP6@BLR-5CG11610CF.amd.com>
Date:   Fri, 15 Sep 2023 20:48:14 +0530
From:   "Gautham R. Shenoy" <gautham.shenoy@....com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Aaron Lu <aaron.lu@...el.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in
 select_idle_cpu()

Hello Chen Yu,

On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
[..snip..]

> > 
> > So despite "reserving" the CPU for p1, which is likely to have its
> > data still hot in the case, we would have scheduled p1', thus
> > defeating the whole purpose of reservation.
> > 
> 
> I see. So you mean, although we reserve the CPU for the wakee,
> the wakee might not choose its previous CPU, which is against our
> goal.


Yes, but only because some other task could have run on the previous
CPU. That other task could be something that was woken up on that CPU
due to:

1) wake-affine choosing that CPU 
2) newidle-balance pulling the other task on that CPU
3) !wake-affine && that CPU was also the other task's previous CPU

It could also be due to this wakee task being woken up on the waker
CPU due to wake-affine.

> 
> The reason to prevent the wakee choosing its previous CPU could be:
> 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
>


> For 1, I think Prateek has expressed the concern. One mitigation method could be
> that, we give penalty to that wakee, if it decides not to choose its previous CPU:

We would be penalizing the task for something that the scheduler
decides :-)

As you point out below, in the presence of the WF_SYNC flag,
wake_affine_idle() prefer the waker CPU over the previous CPU when
they are on different LLCs and when the waker is the only task.

This strategy makes sense for two reasons:

1) The wakee may be consuming the data produced by the waker.
2) Since the wakeup will happen on the local CPU, there is no risk of
   task-stacking, exactly what your SIS_CURRENT patchset was
   attempting.

But this strategy would also result in increased task-migration. Which
both Mattieu and you have found is not so beneficial for workloads
such as hackbench. Is it only because task's data is still hot in the
previous CPU's cache ? Or is there more to it ?


It would be good to confirm if this is why lower migration is better
for these kinds of workloads.

> 
> "
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> 	p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.
> "
> 
> For 2, maybe inhit the newidle balance, something in my mind:
> 
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	u64 t0, t1, curr_cost = 0;
>  	struct sched_domain *sd;
>  	int pulled_task = 0;
> +	bool cache_hot = false;
>  
>  	update_misfit_status(NULL, this_rq);
>  
> @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	rcu_read_lock();
>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
> +	if (sched_feat(SIS_CACHE)) {
> +		s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> +
> +		/*
> +		 * If a short time later, a short sleeping task will be woken up
> +		 * on this idle CPU, do not launch the newidle balance.
> +		 */
> +		if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> +			cache_hot = true;
> +	}
> +
>  	if (!READ_ONCE(this_rq->rd->overload) ||
> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> +	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

>  
>  		if (sd)
>  			update_next_balance(sd, &next_balance);

If the benefit that the workload obtains is really due to the data
being hot near its previous CPU, then this seems a sensible thing to
do.

It would be good to confirm this. Let me get some IBS data for
hackbench which is the workload which likes a sticky wakeup.

--
Thanks and Regards
gautham.



> 
> 
> > To be honest, this isn't so bad, because we have been able to avoid a
> > migration in this case.
> > 
> > > 
> > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > > CPU but it is skipped due to cache-hot?
> > 
> > I had originally thought about this, but then as you pointed out we
> > have an opportunity to pick the previous cpu in the early checks
> > inside select_idle_sibling().
> > 
> > > 
> > > > Have you considered recording p1's identity in the
> > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > > return the previous CPU if it is cache hot and the wakee is
> > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > > scan.
> > > > 
> > > 
> > > Yes this seems to be donable, and one problem would be, if there are
> > > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > > want to deal with multiple dequeued tasks. If we record all of them,
> > > it might be costly.
> > 
> > If there are multiple dequeued tasks, then it doesn't make sense to
> > record the identity of the tasks. However, we need the bail out to be
> > much earlier, in select_task_rq_fair(), perhaps even before the
> > want_affine() checks.
> > 
> > After all, if the previous CPU is idle, and its cache_hot_timeout
> > hasn't expired, and if the wakee's sleep duration is less than the
> > cache_hot_timeout, why don't we just pick it here and be done with it?
> > 
> 
> Yes we can return the previous CPU earlier, one concern is that, should
> we honor WF_SYNC flag or not,  because in wake_affine_idle(), WF_SYNC
> seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
> if the current CPU has 1 running task, and the previous CPU is idle,
> wake_affine_idle() still prefers the current CPU.
> 
> thanks,
> Chenyu