linux-kernel - Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ feature within newidle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb953751-0691-c932-4d53-0e8ab7653bb2@amd.com>
Date:   Fri, 1 Sep 2023 01:17:21 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [RFC PATCH 2/3] sched/fair: Improve integration of SHARED_RUNQ
 feature within newidle_balance

Hello David,

Thank you for taking a look at this despite being on vacation.

On 9/1/2023 12:15 AM, David Vernet wrote:
> On Thu, Aug 31, 2023 at 04:15:07PM +0530, K Prateek Nayak wrote:
>> This patch takes the relevant optimizations from [1] in
>> newidle_balance(). Following is the breakdown:
> 
> Thanks for working on this. I think the fix you added for skipping <=
> LLC domains makes sense. The others possibly as well

I too am in doubt with some of them but I left them in since I was
building on top of the cumulative diff.

> -- left some
> comments below!
> 
>>
>> - Check "rq->rd->overload" before jumping into newidle_balance, even
>>   with SHARED_RQ feat enabled.
> 
> Out of curiosity -- did you observe this making a material difference in
> your tests? After thinking about it some more, though I see the argument
> for why it would be logical to check if we're overloaded, I'm still
> thinking that it's more ideal to just always check the SHARED_RUNQ.
> rd->overload is only set in find_busiest_group() when we load balance,
> so I worry that having SHARED_RUNQ follow rd->overload may just end up
> making it redundant with normal load balancing in many cases.
> 
> So yeah, while I certainly understand the idea (and would like to better
> understand what kind of difference it made in your tests), I still feel
> pretty strongly that SHARED_RUNQ makes the most sense as a feature when
> it ignores all of these heuristics and just tries to maximize work
> conservation.
> 
> What do you think?

Actually, as it turns out, it was probably a combination of the
rq->avg_idle check + updating of cost that got the performance back
during experimenting. In Patch 3, I've give the results with this patch
alone and it makes no difference, for tbench 128-client at least. There
is the same rq lock contention I mentioned previously which is why the
per-shard "overload" flag.

Based on Anna-Maria's observation in [1], we have a short idling, spread
across the system with tbench. Now it is possible we are doing a
newidle_balance() when it would have been better off to let the CPU idle
for that short duration without and not cause a contention for the rq
lock.

[1] https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de/

> 
>> - Call update_next_balance() for all the domains till MC Domain in
>>   when SHARED_RQ path is taken.
> 
> I _think_ this makes sense. Though even in this case, I feel that it may
> be slightly confusing and/or incorrect to push back the balance time
> just because we didn't find a task in our current CCX's shared_runq.
> Maybe we should avoid mucking with load balancing? Not sure, but I am
> leaning towards what you're proposing here as a better approach.

This requires a deeper look and more testing yes.

> 
>> - Account cost from shared_runq_pick_next_task() and update
>>   curr_cost and sd->max_newidle_lb_cost accordingly.
> 
> Yep, I think this is the correct thing to do.
> 
>>
>> - Move the initial rq_unpin_lock() logic around. Also, the caller of
>>   shared_runq_pick_next_task() is responsible for calling
>>   rq_repin_lock() if the return value is non zero. (Needs to be verified
>>   everything is right with LOCKDEP)
> 
> Still need to think more about this, but it's purely just tactical and
> can easily be fixed it we need.

I agree. I'll leave the full picture of this below in
[Locking code movement clarifications] since we seem to keep coming back
to this and it would be good to have more eyes on what is going on in my
mind :)

> 
>>
>> - Includes a fix to skip directly above the LLC domain when calling the
>>   load_balance() in newidle_balance()
> 
> Big fix, thanks again for noticing it.
> 
>> All other surgery from [1] has been removed.
>>
>> Link: https://lore.kernel.org/all/31aeb639-1d66-2d12-1673-c19fed0ab33a@amd.com/ [1]
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
>> ---
>>  kernel/sched/fair.c | 94 ++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 67 insertions(+), 27 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bf844ffa79c2..446ffdad49e1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -337,7 +337,6 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
>>  
>>  	return ret;
>>  }
>> @@ -12276,50 +12275,83 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>> +	    /* Look at rq->avg_idle iff SHARED_RUNQ is disabled */
>> +	    (!sched_feat(SHARED_RUNQ) && sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>> +
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		struct sched_domain *tmp = sd;
>> +
>> +		t0 = sched_clock_cpu(this_cpu);
>> +
>> +		/* Do update_next_balance() for all domains within LLC */
>> +		while (tmp) {
>> +			update_next_balance(tmp, &next_balance);
>> +			tmp = tmp->child;
>> +		}
>> +
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			if (sd) {
>> +				curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +				/*
>> +				 * Will help bail out of scans of higer domains
>> +				 * slightly earlier.
>> +				 */
>> +				update_newidle_cost(sd, curr_cost);
>> +			}
>> +
>> +			rcu_read_unlock();
>> +			goto out_swq;
>> +		}
>> +
>> +		if (sd) {
>> +			t1 = sched_clock_cpu(this_cpu);
>> +			curr_cost += t1 - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +		}
>> +
>> +		/*
>> +		 * Since shared_runq_pick_next_task() can take a while
>> +		 * check if the CPU was targetted for a wakeup in the
>> +		 * meantime.
>> +		 */
>> +		if (this_rq->ttwu_pending) {
>> +			rcu_read_unlock();
>> +			return 0;
>> +		}
> 
> At first I was wondering whether we should do this above
> update_newidle_cost(), but I think it makes sense to always call
> update_newidle_cost() after we've failed to get a task from
> shared_runq_pick_next_task().

Indeed. I think the cost might be useful to be accounted for.

> 
>> +	}
>>  	rcu_read_unlock();
>>  
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
> 
> Don't you need to do this before you exit on the rq->ttwu_pending path?

[Locking code movement clarifications]

Okay this is where I'll put all the locking bits I have in my head:

o First, the removal of rq_repin_lock() in shared_runq_pick_next_task()

  Since this is only called from newidle_balance(), it is easy to
  isolate the changes. shared_runq_pick_next_task() can return either
  0, 1 or -1. The interpretation is same as return value of
  newidle_balance():

   0: Unsuccessful at pulling task but the rq lock was never released
      and reacquired - it was held all the time.

   1: Task was pulled successfully. The rq lock was released and
      reacquired in the process but now, with the above changes, it is
      not pinned.

  -1: Unsuccessful at pulling task but the rq lock was released and
      reacquired in the process and now, with the above changes, it is
      not pinned.

  Now the following block:

	pulled_task = shared_runq_pick_next_task(this_rq, rf);
	if (pulled_task) {
		...
		goto out_swq;
	}

  takes care of the case where return values are -1, or 1. The "out_swq"
  label is almost towards the end of newidle_balance() and just before
  returning, the newidle_balance() does:

	rq_repin_lock(this_rq, rf);

  So this path will repin the lock.

  Now for the case where shared_runq_pick_next_task() return 0.

o Which brings us to the question you asked above

  newidle_balance() is called with the rq lock held and pinned, and it
  expects the same when newidle_balance() reruns. The very first bailout
  check in newidle_balance() is:

	if (this_rq->ttwu_pending)
		return 0;

  so we return without doing any changed to the state of rq lock.

  Coming to the above changes, if we have to hit the ttwu_pending
  bailout you pointed at, shared_runq_pick_next_task() should return 0,
  signifying no modification to state of the lock or pinning. Then we
  update the cost, and come to ttwu_pending check. We still have the
  lock held, and it is pinned. Thus we do not need to unpin the lock
  since we newidle_balance() is expected to return with lock held and
  it being pinned.

Please let me know if I've missed something.

> 
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> @@ -12335,6 +12367,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
>> +
>>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>>  
>>  			pulled_task = load_balance(this_cpu, this_rq,
>> @@ -12361,6 +12400,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
> 
> 
> Thanks,
> David
 
--
Thanks and Regards,
Prateek