lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230831184554.GB531917@maniforge>
Date:   Thu, 31 Aug 2023 13:45:54 -0500
From:   David Vernet <void@...ifault.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ
 feature within newidle_balance

On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
> This patch takes the relevant optimizations from [1] in
> newidle_balance(). Following is the breakdown:

Thanks for working on this. I think the fix you added for skipping <=
LLC domains makes sense. The others possibly as well -- left some
comments below!

> 
> - Check "rq->rd->overload" before jumping into newidle_balance, even
>   with SHARED_RQ feat enabled.

Out of curiosity -- did you observe this making a material difference in
your tests? After thinking about it some more, though I see the argument
for why it would be logical to check if we're overloaded, I'm still
thinking that it's more ideal to just always check the SHARED_RUNQ.
rd->overload is only set in find_busiest_group() when we load balance,
so I worry that having SHARED_RUNQ follow rd->overload may just end up
making it redundant with normal load balancing in many cases.

So yeah, while I certainly understand the idea (and would like to better
understand what kind of difference it made in your tests), I still feel
pretty strongly that SHARED_RUNQ makes the most sense as a feature when
it ignores all of these heuristics and just tries to maximize work
conservation.

What do you think?

> - Call update_next_balance() for all the domains till MC Domain in
>   when SHARED_RQ path is taken.

I _think_ this makes sense. Though even in this case, I feel that it may
be slightly confusing and/or incorrect to push back the balance time
just because we didn't find a task in our current CCX's shared_runq.
Maybe we should avoid mucking with load balancing? Not sure, but I am
leaning towards what you're proposing here as a better approach.

> - Account cost from shared_runq_pick_next_task() and update
>   curr_cost and sd->max_newidle_lb_cost accordingly.

Yep, I think this is the correct thing to do.

> 
> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>   shared_runq_pick_next_task() is responsible for calling
>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>   everything is right with LOCKDEP)

Still need to think more about this, but it's purely just tactical and
can easily be fixed it we need.

> 
> - Includes a fix to skip directly above the LLC domain when calling the
>   load_balance() in newidle_balance()

Big fix, thanks again for noticing it.

> All other surgery from [1] has been removed.
> 
> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 67 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf844ffa79c2..446ffdad49e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>  		rq_unpin_lock(rq, &src_rf);
>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>  	}
> -	rq_repin_lock(rq, rf);
>  
>  	return ret;
>  }
> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	if (!cpu_active(this_cpu))
>  		return 0;
>  
> -	if (sched_feat(SHARED_RUNQ)) {
> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> -		if (pulled_task)
> -			return pulled_task;
> -	}
> -
>  	/*
>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>  	 * measure the duration of idle_balance() as idle time.
>  	 */
>  	this_rq->idle_stamp = rq_clock(this_rq);
>  
> -	/*
> -	 * This is OK, because current is on_cpu, which avoids it being picked
> -	 * for load-balance and preemption/IRQs are still disabled avoiding
> -	 * further scheduler activity on it and we're being very careful to
> -	 * re-start the picking loop.
> -	 */
> -	rq_unpin_lock(this_rq, rf);
> -
>  	rcu_read_lock();
> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> -
> -	/*
> -	 * Skip <= LLC domains as they likely won't have any tasks if the
> -	 * shared runq is empty.
> -	 */
> -	if (sched_feat(SHARED_RUNQ)) {
> +	if (sched_feat(SHARED_RUNQ))
>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
> -		if (likely(sd))
> -			sd = sd->parent;
> -	}
> +	else
> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
>  	if (!READ_ONCE(this_rq->rd->overload) ||
> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>  
> -		if (sd)
> +		while (sd) {
>  			update_next_balance(sd, &next_balance);
> +			sd = sd->child;
> +		}
> +
>  		rcu_read_unlock();
>  
>  		goto out;
>  	}
> +
> +	if (sched_feat(SHARED_RUNQ)) {
> +		struct sched_domain *tmp = sd;
> +
> +		t0 = sched_clock_cpu(this_cpu);
> +
> +		/* Do update_next_balance() for all domains within LLC */
> +		while (tmp) {
> +			update_next_balance(tmp, &next_balance);
> +			tmp = tmp->child;
> +		}
> +
> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
> +		if (pulled_task) {
> +			if (sd) {
> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
> +				/*
> +				 * Will help bail out of scans of higer domains
> +				 * slightly earlier.
> +				 */
> +				update_newidle_cost(sd, curr_cost);
> +			}
> +
> +			rcu_read_unlock();
> +			goto out_swq;
> +		}
> +
> +		if (sd) {
> +			t1 = sched_clock_cpu(this_cpu);
> +			curr_cost += t1 - t0;
> +			update_newidle_cost(sd, curr_cost);
> +		}
> +
> +		/*
> +		 * Since shared_runq_pick_next_task() can take a while
> +		 * check if the CPU was targetted for a wakeup in the
> +		 * meantime.
> +		 */
> +		if (this_rq->ttwu_pending) {
> +			rcu_read_unlock();
> +			return 0;
> +		}

At first I was wondering whether we should do this above
update_newidle_cost(), but I think it makes sense to always call
update_newidle_cost() after we've failed to get a task from
shared_runq_pick_next_task().

> +	}
>  	rcu_read_unlock();
>  
> +	/*
> +	 * This is OK, because current is on_cpu, which avoids it being picked
> +	 * for load-balance and preemption/IRQs are still disabled avoiding
> +	 * further scheduler activity on it and we're being very careful to
> +	 * re-start the picking loop.
> +	 */
> +	rq_unpin_lock(this_rq, rf);

Don't you need to do this before you exit on the rq->ttwu_pending path?

>  	raw_spin_rq_unlock(this_rq);
>  
>  	t0 = sched_clock_cpu(this_cpu);
> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>  			break;
>  
> +		/*
> +		 * Skip <= LLC domains as they likely won't have any tasks if the
> +		 * shared runq is empty.
> +		 */
> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
> +			continue;
> +
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>  
>  			pulled_task = load_balance(this_cpu, this_rq,
> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  
>  	raw_spin_rq_lock(this_rq);
>  
> +out_swq:
>  	if (curr_cost > this_rq->max_idle_balance_cost)
>  		this_rq->max_idle_balance_cost = curr_cost;
>  


Thanks,
David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ