linux-kernel - Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2e1baaea-c18a-453c-9dec-1036966f1baf@arm.com>
Date: Mon, 20 May 2024 16:20:31 +0100
From: Luis Machado <luis.machado@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 wuyun.abel@...edance.com, tglx@...utronix.de, efault@....de, nd
 <nd@....com>, John Stultz <jstultz@...gle.com>, Hongyan.Xia2@....com
Subject: Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

On 5/15/24 12:48, Peter Zijlstra wrote:
> On Wed, May 15, 2024 at 11:36:49AM +0200, Peter Zijlstra wrote:
>> On Fri, May 10, 2024 at 03:49:46PM +0100, Luis Machado wrote:
>>> Just a quick update on this. While investigating this behavior, I
>>> spotted very high loadavg values on an idle system. For instance:
>>>
>>> load average: 4733.84, 4721.24, 4680.33
>>>
>>> I wonder if someone else also spotted this.
>>
>> Hadn't spotted it, but now that you mention it, I can definitely see it.
>>
>> Let me go prod with something sharp. Thanks!
> 
> What's the point of making notes if you then don't read them... *sigh*.

Makes it look like you did read them? :-)

> 
> Does this help?
> 

It does address the load_avg issues as Mike G pointed out. Thanks for the quick patch!

Unfortunately, for me, it didn't help with the energy regression. So my investigation
continues.

Something still seems to be driving things out of the smaller cores and into
the bigger cores. And the load_avg accounting proved to be a red herring.

> ---
>  kernel/sched/core.c  | 23 ++++++++++++-----------
>  kernel/sched/fair.c  |  4 ++--
>  kernel/sched/sched.h |  8 ++++++++
>  3 files changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5ffd7e047393..43f061bcfe54 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2154,14 +2154,18 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
>  
>  void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> -	bool sleep = flags & DEQUEUE_SLEEP;
> +	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
>  
> -	if (dequeue_task(rq, p, flags)) {
> -		WRITE_ONCE(p->on_rq, sleep ? 0 : TASK_ON_RQ_MIGRATING);
> -		ASSERT_EXCLUSIVE_WRITER(p->on_rq);
> -	} else {
> -		SCHED_WARN_ON(!sleep); /* only sleep can fail */
> -	}
> +	dequeue_task(rq, p, flags);
> +
> +	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
> +	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
> +}
> +
> +static void block_task(struct rq *rq, struct task_struct *p, int flags)
> +{
> +	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
> +		__block_task(rq, p);
>  }
>  
>  static inline int __normal_prio(int policy, int rt_prio, int nice)
> @@ -6693,9 +6697,6 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  				!(prev_state & TASK_NOLOAD) &&
>  				!(prev_state & TASK_FROZEN);
>  
> -			if (prev->sched_contributes_to_load)
> -				rq->nr_uninterruptible++;
> -
>  			/*
>  			 * __schedule()			ttwu()
>  			 *   prev_state = prev->state;    if (p->on_rq && ...)
> @@ -6707,7 +6708,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
>  			 *
>  			 * After this, schedule() must not care about p->state any more.
>  			 */
> -			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> +			block_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
>  
>  			if (prev->in_iowait) {
>  				atomic_inc(&rq->nr_iowait);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 536eabcb1a71..596a5fabe490 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7032,8 +7032,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  			util_est_update(&rq->cfs, p, task_sleep);
>  			hrtick_update(rq);
>  
> -			/* Fix-up what deactivate_task() skipped. */
> -			WRITE_ONCE(p->on_rq, 0);
> +			/* Fix-up what block_task() skipped. */
> +			__block_task(rq, p);
>  		}
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 12841d8195c5..48e5f49d9bc2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2560,6 +2560,14 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)
>  	sched_update_tick_dependency(rq);
>  }
>  
> +static inline void __block_task(struct rq *rq, struct task_struct *p)
> +{
> +	WRITE_ONCE(p->on_rq, 0);
> +	ASSERT_EXCLUSIVE_WRITER(p->on_rq);
> +	if (p->sched_contributes_to_load)
> +		rq->nr_uninterruptible++;
> +}
> +
>  extern void activate_task(struct rq *rq, struct task_struct *p, int flags);
>  extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
>