linux-kernel - Re: [PATCH 1/1] sched: cfs_rq h_load might not update due to irq disable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191121123804.GR4097@hirez.programming.kicks-ass.net>
Date:   Thu, 21 Nov 2019 13:38:04 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     YT Chang <yt.chang@...iatek.com>
Cc:     Matthias Brugger <matthias.bgg@...il.com>,
        wsd_upstream@...iatek.com, linux-kernel@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org,
        linux-mediatek@...ts.infradead.org,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH 1/1] sched: cfs_rq h_load might not update due to irq
 disable

On Thu, Nov 21, 2019 at 04:30:09PM +0800, YT Chang wrote:
> Syndrome:
> 
> Two CPUs might do idle balance in the same time.
> One CPU does idle balance and pulls some tasks.
> However before pick next task, ALL task are pulled back to other CPU.
> That results in infinite loop in both CPUs.

Can you easily reproduce this?

> =========================================
> code flow:
> 
> in pick_next_task_fair()
> 
> again:
> 
> if nr_running == 0
> 	goto idle
> pick next task
> 	return
> 
> idle:
> 	idle_balance
>        /* pull some tasks from other CPU,
>         * However other CPU are also do idle balance,
> 	* and pull back these task */
> 
> 	go to again
> 
> =========================================
> The result to pull ALL tasks back when the task_h_load
> is incorrect and too low.

Clearly you're not running a PREEMPT kernel, otherwise the break in
detach_tasks() would've saved you, right?

> static unsigned long task_h_load(struct task_struct *p)
> {
>         struct cfs_rq *cfs_rq = task_cfs_rq(p);
> 
> 	update_cfs_rq_h_load(cfs_rq);
> 	return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
> 			cfs_rq->runnable_load_avg + 1);
> }
> 
> The cfs_rq->h_load is incorrect and might too small.
> The original idea of cfs_rq::last_h_load_update will not
> update cfs_rq::h_load more than once a jiffies.
> When the Two CPUs pull each other in the pick_next_task_fair,
> the irq disabled and result in jiffie not update.
> (Other CPUs wait for runqueue lock locked by the two CPUs.
> So, ALL CPUs are irq disabled.)

This cannot be true; because the loop drops rq->lock, so other CPUs
should have an opportunity to acquire the lock and make progress.

> Solution:
> cfs_rq h_load might not update due to irq disable
> use sched_clock instead jiffies
> 
> Signed-off-by: YT Chang <yt.chang@...iatek.com>
> ---
>  kernel/sched/fair.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e..231c53f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7578,9 +7578,11 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
>  	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
> -	unsigned long now = jiffies;
> +	u64 now = sched_clock_cpu(cpu_of(rq));
>  	unsigned long load;
>  
> +	now = now * HZ >> 30;
> +
>  	if (cfs_rq->last_h_load_update == now)
>  		return;
>  

This is disguisting and wrong. That is not the correct relation between
sched_clock() and jiffies.