linux-kernel - Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1ef98b4e-80da-b524-3a01-bdf3b85f1db3@amd.com>
Date:   Thu, 31 Aug 2023 09:51:01 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     "Gautham R. Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, aaron.lu@...el.com,
        wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [PATCH v3 0/7] sched: Implement shared runqueue in CFS

Hello David,

Thank you for taking a look at the report.

On 8/31/2023 8:02 AM, David Vernet wrote:
> On Wed, Aug 30, 2023 at 03:26:40PM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> Short update based on some of my experimentation.
>>
>> Disclaimer: I've been only looking at tbench 128 client case on a dual
>> socket 3rd Generation EPYC system (2x 64C/128T). Wider results may
>> vary but I have some information that may help with the debug and to
>> proceed further.
>>
>> On 8/25/2023 4:21 AM, David Vernet wrote:
>>> On Thu, Aug 24, 2023 at 04:44:19PM +0530, Gautham R. Shenoy wrote:
>>>> Hello David,
>>>>
>>>> On Fri, Aug 18, 2023 at 02:19:03PM +0530, Gautham R. Shenoy wrote:
>>>>> Hello David,
>>>>>
>>>>> On Fri, Aug 18, 2023 at 12:03:55AM -0500, David Vernet wrote:
>>>>>> On Thu, Aug 17, 2023 at 02:12:03PM +0530, Gautham R. Shenoy wrote:
>>>>>>> Hello David,
>>>>>>
>>>>>> Hello Gautham,
>>>>>>
>>>>>> Thanks a lot as always for running some benchmarks and analyzing these
>>>>>> changes.
>>>>>>
>>>>>>> On Wed, Aug 09, 2023 at 05:12:11PM -0500, David Vernet wrote:
>>>>>>>> Changes
>>>>>>>> -------
>>>>>>>>
>>>>>>>> This is v3 of the shared runqueue patchset. This patch set is based off
>>>>>>>> of commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>>>>>>>> bandwidth in use") on the sched/core branch of tip.git.
>>>>>>>
>>>>>>>
>>>>>>> I tested the patches on Zen3 and Zen4 EPYC Servers like last time. I
>>>>>>> notice that apart from hackbench, every other bechmark is showing
>>>>>>> regressions with this patch series. Quick summary of my observations:
>>>>>>
>>>>>> Just to verify per our prior conversation [0], was this latest set of
>>>>>> benchmarks run with boost disabled?
>>>>>
>>>>> Boost is enabled by default. I will queue a run tonight with boost
>>>>> disabled.
>>>>
>>>> Apologies for the delay. I didn't see any changes with boost-disabled
>>>> and with reverting the optimization to bail out of the
>>>> newidle_balance() for SMT and MC domains when there was no task to be
>>>> pulled from the shared-runq. I reran the whole thing once again, just
>>>> to rule out any possible variance. The results came out the same.
>>>
>>> Thanks a lot for taking the time to run more benchmarks.
>>>
>>>> With the boost disabled, and the optimization reverted, the results
>>>> don't change much.
>>>
>>> Hmmm, I see. So, that was the only real substantive "change" between v2
>>> -> v3. The other changes were supporting hotplug / domain recreation,
>>> optimizing locking a bit, and fixing small bugs like the return value
>>> from shared_runq_pick_next_task(), draining the queue when the feature
>>> is disabled, and fixing the lkp errors.
>>>
>>> With all that said, it seems very possible that the regression is due to
>>> changes in sched/core between commit ebb83d84e49b ("sched/core: Avoid
>>> multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") in v2,
>>> and commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs
>>> bandwidth in use") in v3. EEVDF was merged in that window, so that could
>>> be one explanation for the context switch rate being so much higher.
>>>
>>>> It doesn't appear that the optimization is the cause for increase in
>>>> the number of load-balancing attempts at the DIE and the NUMA
>>>> domains. I have shared the counts of the newidle_balance with and
>>>> without SHARED_RUNQ below for tbench and it can be noticed that the
>>>> counts are significantly higher for the 64 clients and 128 clients. I
>>>> also captured the counts/s of find_busiest_group() using funccount.py
>>>> which tells the same story. So the drop in the performance for tbench
>>>> with your patches strongly correlates with the increase in
>>>> load-balancing attempts.
>>>>
>>>> newidle balance is undertaken only if the overload flag is set and the
>>>> expected idle duration is greater than the avg load balancing cost. It
>>>> is hard to imagine why should the shared runq cause the overload flag
>>>> to be set!
>>>
>>> Yeah, I'm not sure either about how or why woshared_runq uld cause this
>>> This is purely hypothetical, but is it possible that shared_runq causes
>>> idle cores to on average _stay_ idle longer due to other cores pulling
>>> tasks that would have otherwise been load balanced to those cores?
>>>
>>> Meaning -- say CPU0 is idle, and there are tasks on other rqs which
>>> could be load balanced. Without shared_runq, CPU0 might be woken up to
>>> run a task from a periodic load balance. With shared_runq, any active
>>> core that would otherwise have gone idle could pull the task, keeping
>>> CPU0 idle.
>>>
>>> What do you think? I could be totally off here.
>>>
>>> From my perspective, I'm not too worried about this given that we're
>>> seeing gains in other areas such as kernel compile as I showed in [0],
>>> though I definitely would like to better understand it.
>>
>> Let me paste a cumulative diff containing everything I've tried since
>> it'll be easy to explain.
>>
>> o Performance numbers for tbench 128 clients:
>>
>> tip			: 1.00 (Var: 0.57%)
>> tip + vanilla v3	: 0.39 (var: 1.15%) (%diff: -60.74%)
>> tip + v3 + diff		: 0.99 (var: 0.61%) (%diff: -00.24%)
>>
>> tip is at commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when
>> cfs bandwidth in use"), same as what Gautham used, so no EEVDF yet.
>>
>> o Cumulative Diff
>>
>> Should apply cleanly on top of tip at above commit + this series as is.
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>>  } ____cacheline_aligned;
>>  
>>  /* This would likely work better as a configurable knob via debugfs */
>> -#define SHARED_RUNQ_SHARD_SZ 6
>> +#define SHARED_RUNQ_SHARD_SZ 16
>>  #define SHARED_RUNQ_MAX_SHARDS \
>>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>>  
>> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>>  }
>>  
>>  static struct task_struct *
>> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>>  {
>> +	int target = cpu_of(rq);
>>  	struct task_struct *p;
>>  
>>  	if (list_empty(&shard->list))
>>  		return NULL;
>>  
>>  	raw_spin_lock(&shard->lock);
>> +again:
>>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>>  				     shared_runq_node);
>> -	if (p && is_cpu_allowed(p, target))
>> +
>> +	/* If we find a task, delete if from list regardless */
>> +	if (p) {
>>  		list_del_init(&p->shared_runq_node);
>> -	else
>> -		p = NULL;
>> +
>> +		if (!task_on_rq_queued(p) ||
>> +		    task_on_cpu(task_rq(p), p) ||
> 
> Have you observed !task_on_rq_queued() or task_on_cpu() returning true
> here? The task should have removed itself from the shard when
> __dequeue_entity() is called from set_next_entity() when it's scheduled
> in pick_next_task_fair(). The reason we have to check in
> shared_runq_pick_next_task() is that between dequeuing the task from the
> shared_runq and getting its rq lock, it could have been scheduled on its
> current rq. But if the task was scheduled first, it should have removed
> itself from the shard.

Ah! Thank you for clarifying. This is just a copy-paste of the bailout
in shared_runq_pick_next_task(). I see "!task_on_rq_queued()" cannot be
true here since this is with the shared rq lock held. Thank you for
pointing this out.

> 
>> +		    !is_cpu_allowed(p, target)) {
>> +			if (rq->ttwu_pending) {
>> +				p = NULL;
>> +				goto out;
>> +			}
> 
> Have you observed this as well? If the task is enqueued on the ttwu
> queue wakelist, it isn't enqueued on the waking CPU, so it shouldn't be
> added to the shared_runq right?

This is a bailout on the retry logic. Since the shared_rq could contain
many tasks, I did not want to retry until the queue goes empty with a
possible pending wakeup. ... [1]

> 
>> +
>> +			goto again;
>> +		}
>> +	}
>> +
>> +out:
>>  	raw_spin_unlock(&shard->lock);
>>  
>>  	return p;
>> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		curr_idx = (starting_idx + i) % num_shards;
>>  		shard = &shared_runq->shards[curr_idx];
>>  
>> -		p = shared_runq_pop_task(shard, cpu_of(rq));
>> +		p = shared_runq_pop_task(shard, rq);
>>  		if (p)
>>  			break;
>> +
>> +		if (rq->ttwu_pending)
>> +			return 0;
> 
> Same here r.e. rq->ttwu_pending. This should be handled in the
> 
> if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p))
> 
> check below, no? Note that task_on_rq_queued(p) should only return true
> if the task has made it to ttwu_do_activate(), and if it hasn't, I don't
> think it should be in the shard in the first place.

Noted! Thank you for clarifying again. 

> 
>>  	}
>>  	if (!p)
>>  		return 0;
>> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>  		update_rq_clock(src_rq);
>>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>> -		ret = 1;
>>  	}
>>  
>>  	if (src_rq != rq) {
>>  		task_rq_unlock(src_rq, p, &src_rf);
>>  		raw_spin_rq_lock(rq);
>>  	} else {
>> +		ret = 1;
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
> 
> Huh, wouldn't this cause a WARN to be issued the next time we invoke
> rq_clock() in newidle_balance() if we weren't able to find a task? Or
> was it because we moved the SHARED_RUNQ logic to below where we check
> rq_clock()? In general though, I don't think this should be removed. At
> the very least, it should be tested with lockdep.

So beyond this point, ret != 0, which will now jump to "out_swq" label
in newidle_balance()  which does a "rq_repin_lock(this_rq, rf)" just
before returning.

I'll check if my surgery is right with lockdep enabled.

> 
>>  	return ret;
>>  }
>> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>>  	rcu_read_unlock();
>>  
>> +	t0 = sched_clock_cpu(this_cpu);
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +			goto out_swq;
>> +		}
>> +	}
> 
> Hmmm, why did you move this further down in newidle_balance()? We don't
> want to skip trying to get a task from the shared_runq if rq->avg_idle <
> sd->max_newidle_lb_cost.

I'll check if only rd->overload check is sufficient.

> 
>> +
>> +	/* Check again for pending wakeups */
>> +	if (this_rq->ttwu_pending)
>> +		return 0;
>> +
>> +	t1 = sched_clock_cpu(this_cpu);
>> +	curr_cost += t1 - t0;
>> +
>> +	if (sd)
>> +		update_newidle_cost(sd, curr_cost);
>> +
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  		update_next_balance(sd, &next_balance);
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
> 
> This makes sense to me, good call.
> 
>> +
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
>> --
>>
>> o Breakdown
>>
>> I'll proceed to annotate a copy of diff with reasoning behind the changes:
> 
> Ah, ok, you provided explanations :-) I'll leave my questions above
> regardless for posterity.

And I'll refer to answers above ;)

> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -198,7 +198,7 @@ struct shared_runq_shard {
>>  } ____cacheline_aligned;
>>  
>>  /* This would likely work better as a configurable knob via debugfs */
>> -#define SHARED_RUNQ_SHARD_SZ 6
>> +#define SHARED_RUNQ_SHARD_SZ 16
>>  #define SHARED_RUNQ_MAX_SHARDS \
>>  	((NR_CPUS / SHARED_RUNQ_SHARD_SZ) + (NR_CPUS % SHARED_RUNQ_SHARD_SZ != 0))
>>
>> --
>>
>> 	Here I'm setting the SHARED_RUNQ_SHARD_SZ to sd_llc_size for
>> 	my machine. I played around with this and did not see any
>> 	contention for shared_rq lock while running tbench.
> 
> I don't really mind making the shard bigger because it will never be one
> size fits all, but for what it's worth, I saw less contention in netperf
> with a size of 6, but everything else performed fine with a larger
> shard. This is one of those "there's no right answer" things, I'm
> afraid. I think it will be inevitable to make this configurable at some
> point, if we find that it's really causing inefficiencies.

Agreed! For tbench at least, this did not lead to any problems.

> 
>> --
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -322,20 +322,36 @@ void shared_runq_toggle(bool enabling)
>>  }
>>  
>>  static struct task_struct *
>> -shared_runq_pop_task(struct shared_runq_shard *shard, int target)
>> +shared_runq_pop_task(struct shared_runq_shard *shard, struct rq *rq)
>>  {
>> +	int target = cpu_of(rq);
>>  	struct task_struct *p;
>>  
>>  	if (list_empty(&shard->list))
>>  		return NULL;
>>  
>>  	raw_spin_lock(&shard->lock);
>> +again:
>>  	p = list_first_entry_or_null(&shard->list, struct task_struct,
>>  				     shared_runq_node);
>> -	if (p && is_cpu_allowed(p, target))
>> +
>> +	/* If we find a task, delete if from list regardless */
> 
> As I mentioned in my other reply [0], I don't think we should always
> have to delete here. Let me know if I'm missing something.

I overlooked that condition at enqueue. I'll go back to original
implementation here.

> 
> [0]: https://lore.kernel.org/all/20230831013435.GB506447@maniforge/
> 
>> +	if (p) {
>>  		list_del_init(&p->shared_runq_node);
>> -	else
>> -		p = NULL;
>> +
>> +		if (!task_on_rq_queued(p) ||
>> +		    task_on_cpu(task_rq(p), p) ||
>> +		    !is_cpu_allowed(p, target)) {
>> +			if (rq->ttwu_pending) {
>> +				p = NULL;
>> +				goto out;
>> +			}
>> +
>> +			goto again;
>> +		}
>> +	}
>> +
>> +out:
>>  	raw_spin_unlock(&shard->lock);
>>  
>>  	return p;
>> --
>>
>> 	Context: When running perf with IBS, I saw following lock
>> 	contention:
>>
>> -   12.17%  swapper          [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>>    - 10.48% native_queued_spin_lock_slowpath
>>       - 10.30% _raw_spin_lock
>>          - 9.11% __schedule
>>               schedule_idle
>>               do_idle
>>             + cpu_startup_entry
>>          - 0.86% task_rq_lock
>>               newidle_balance
>>               pick_next_task_fair
>>               __schedule
>>               schedule_idle
>>               do_idle
>>             + cpu_startup_entry
>>
>> 	So I imagined the newidle_balance is contending with another
>> 	run_queue going idle when pulling task. Hence, I moved some
>> 	checks in shared_runq_pick_next_task() to here.
> 
> Hmm, so the idea was to avoid contending on the rq lock? As I mentioned
> above, I'm not sure these checks actually buy us anything.

Yup! I think skipping newidle_balance() when rd->overload is not set
itself reduces this contention.

> 
>> 	I was not sure if the task's rq lock needs to be held to do this
>> 	to get an accurate result so I've left the original checks in
>> 	shared_runq_pick_next_task() as it is.
> 
> Yep, we need to have the rq lock held for these functions to return
> consistent data.

Noted! Will fall back to your implementation. Also I did not see
any perf improvement for tbench with this alone.

> 
>> 	Since retry may be costly, I'm using "rq->ttwu_pending" as a
>> 	bail out threshold. Maybe there are better alternates with
>> 	the lb_cost and rq->avg_idle but this was simpler for now.
> 
> Hmm, not sure I'm quite understanding this one. As I mentioned above, I
> don't _think_ this should ever be set for a task enqueued in a shard.
> Were you observing that?

Explained in [1] above. Let me get rid of that whole retry logic
because I see it leading to shared_rq lock contention at enqueue and
dequeue already.

> 
>> 	(Realizing as I write this that this will cause more contention
>> 	with enqueue/dequeue in a busy system. I'll check if that is the
>> 	case)
>>
>> 	P.S. This did not affect the ~60% regression I was seeing one
>> 	bit so the problem was deeper.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -380,9 +396,12 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  		curr_idx = (starting_idx + i) % num_shards;
>>  		shard = &shared_runq->shards[curr_idx];
>>  
>> -		p = shared_runq_pop_task(shard, cpu_of(rq));
>> +		p = shared_runq_pop_task(shard, rq);
>>  		if (p)
>>  			break;
>> +
>> +		if (rq->ttwu_pending)
>> +			return 0;
>>  	}
>>  	if (!p)
>>  		return 0;
>> --
>>
>> 	More bailout logic.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -395,17 +414,16 @@ static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>  	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>  		update_rq_clock(src_rq);
>>  		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>> -		ret = 1;
>>  	}
>>  
>>  	if (src_rq != rq) {
>>  		task_rq_unlock(src_rq, p, &src_rf);
>>  		raw_spin_rq_lock(rq);
>>  	} else {
>> +		ret = 1;
>>  		rq_unpin_lock(rq, &src_rf);
>>  		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>  	}
>> -	rq_repin_lock(rq, rf);
>>  
>>  	return ret;
>>  }
>> --
>>
>> 	Only return 1 is task is actually pulled else return -1
>> 	signifying the path has released and re-aquired the lock.
> 
> Not sure I'm following. If we migrate the task to the current rq, we
> want to return 1 to signify that there are new fair tasks present in the
> rq don't we? It doesn't need to have started there originally for it to
> be present after we move it.

Above move_queued_task(), I see the following comment

	Returns (locked) new rq. Old rq's lock is released.

so wouldn't "src_rq != rq" signify a failed migration? I'm probably
missing something here.

> 
>>
>> 	Also leave the rq_repin_lock() part to caller, i.e.,
>> 	newidle_balance() since it makes up for a nice flow (see
>> 	below).
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12344,50 +12362,59 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	if (!cpu_active(this_cpu))
>>  		return 0;
>>  
>> -	if (sched_feat(SHARED_RUNQ)) {
>> -		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> -		if (pulled_task)
>> -			return pulled_task;
>> -	}
>> -
>>  	/*
>>  	 * We must set idle_stamp _before_ calling idle_balance(), such that we
>>  	 * measure the duration of idle_balance() as idle time.
>>  	 */
>>  	this_rq->idle_stamp = rq_clock(this_rq);
>>  
>> -	/*
>> -	 * This is OK, because current is on_cpu, which avoids it being picked
>> -	 * for load-balance and preemption/IRQs are still disabled avoiding
>> -	 * further scheduler activity on it and we're being very careful to
>> -	 * re-start the picking loop.
>> -	 */
>> -	rq_unpin_lock(this_rq, rf);
>> -
>>  	rcu_read_lock();
>> -	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> -
>> -	/*
>> -	 * Skip <= LLC domains as they likely won't have any tasks if the
>> -	 * shared runq is empty.
>> -	 */
>> -	if (sched_feat(SHARED_RUNQ)) {
>> +	if (sched_feat(SHARED_RUNQ))
>>  		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>> -		if (likely(sd))
>> -			sd = sd->parent;
>> -	}
>> +	else
>> +		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  
>>  	if (!READ_ONCE(this_rq->rd->overload) ||
>>  	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>>  
>> -		if (sd)
>> +		while (sd) {
>>  			update_next_balance(sd, &next_balance);
>> +			sd = sd->child;
>> +		}
>> +
>>  		rcu_read_unlock();
>>  
>>  		goto out;
>>  	}
>>  	rcu_read_unlock();
>>  
>> +	t0 = sched_clock_cpu(this_cpu);
>> +	if (sched_feat(SHARED_RUNQ)) {
>> +		pulled_task = shared_runq_pick_next_task(this_rq, rf);
>> +		if (pulled_task) {
>> +			curr_cost = sched_clock_cpu(this_cpu) - t0;
>> +			update_newidle_cost(sd, curr_cost);
>> +			goto out_swq;
>> +		}
>> +	}
>> +
>> +	/* Check again for pending wakeups */
>> +	if (this_rq->ttwu_pending)
>> +		return 0;
>> +
>> +	t1 = sched_clock_cpu(this_cpu);
>> +	curr_cost += t1 - t0;
>> +
>> +	if (sd)
>> +		update_newidle_cost(sd, curr_cost);
>> +
>> +	/*
>> +	 * This is OK, because current is on_cpu, which avoids it being picked
>> +	 * for load-balance and preemption/IRQs are still disabled avoiding
>> +	 * further scheduler activity on it and we're being very careful to
>> +	 * re-start the picking loop.
>> +	 */
>> +	rq_unpin_lock(this_rq, rf);
>>  	raw_spin_rq_unlock(this_rq);
>>  
>>  	t0 = sched_clock_cpu(this_cpu);
>> --
>>
>> 	This hunk does a few things:
>>
>> 	1. If a task is successfully pulled from shared rq, or if the rq
>> 	   lock had been released and re-acquired with, jump to the
>> 	   very end where we check a bunch of conditions and return
>> 	   accordingly.
>>
>> 	2. Move the shared rq picking after the "rd->overload" and
>> 	   checks against "rq->avg_idle".
>>
>> 	   P.S. This recovered half the performance that was lost.
> 
> Sorry, which performance are you referring to? I'm not thrilled about
> this part because it's another heuristic for users to have to reason
> about. _Maybe_ it makes sense to keep the rd->overload check? I don't
> think we should keep the rq->avg_idle check though unless it's
> absolutely necessary, and I'd have to think more about the rq->overload
> check.

Performance everywhere is tbench-128 clients on test machine since I was
very focused on that alone. I understand why you are "not thrilled"
here. Let me go check which one of those conditions is more relevant.

(P.S. Anna-Maria did some experiments around avg_idle and tbench in
https://lore.kernel.org/lkml/80956e8f-761e-b74-1c7a-3966f9e8d934@linutronix.de/) 

> 
>> 	3. Update the newidle_balance_cost via update_newidle_cost()
>> 	   since that is also used to determine the previous bailout
>> 	   threshold.
> 
> I think this makes sense as well, but need to think about it more.
> 
>> 	4. A bunch of update_next_balance().
> 
> I guess this should be OK, though I would expect us to have to load
> balance less with SHARED_RUNQ.
> 
>> 	5. Move rq_unpin_lock() below. I do not know the implication of
>> 	   this the kernel is not complaining so far (mind you I'm on
>> 	   x86 and I do not have lockdep enabled)
> 
> If you move rq_unpin_lock(), you should probably run with lockdep to see
> what happens :-) There are also implications for tracking whether it's
> safe to look at the rq clock.

Yup! I traced the path in code and it looked okay but this is just me
doing naive experiments. Lockdep should set me straight :)

> 
>>
>> 	A combination of 3 and 4 seemed to give back the other half of
>> 	tbench performance.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12400,6 +12427,13 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  		update_next_balance(sd, &next_balance);
>>  
>> +		/*
>> +		 * Skip <= LLC domains as they likely won't have any tasks if the
>> +		 * shared runq is empty.
>> +		 */
>> +		if (sched_feat(SHARED_RUNQ) && (sd->flags & SD_SHARE_PKG_RESOURCES))
>> +			continue;
>> +
> 
> Agreed on this.
> 
>>  		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
>>  			break;
>>  
>> --
>>
>> 	This was based on my suggestion in the parallel thread.
>>
>> 	P.S. This alone, without the changes in previous hunk showed no
>> 	difference in performance with results same as vanilla v3.
>>
>> --
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d67d86d3bfdf..f1e64412fd48 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12429,6 +12463,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  
>>  	raw_spin_rq_lock(this_rq);
>>  
>> +out_swq:
>>  	if (curr_cost > this_rq->max_idle_balance_cost)
>>  		this_rq->max_idle_balance_cost = curr_cost;
>>  
>> --
>>
>> 	The last part of newidle_balance() does a bunch of accounting
>> 	which is relevant after the above changes. Also the
>> 	rq_repin_lock() I had removed now happens here.
>>
>> --
>>
>> Again most of this is lightly tested with just one workload but I would
>> like to hear your thoughts, especially with the significance of
>> "rd->overload", "max_newidle_lb_cost", and "update_next_balance()".
>> however, I'm afraid these may be the bits that led to the drop in
>> utilization you mentioned in the first place.
> 
> Exactly :-( I think your proposal to fix how we skip load balancing on
> the LLC if SHARED_RUNQ is enabled makes sense, but I'd really prefer to
> avoid adding these heuristics to avoid contention for specific
> workloads. The goal of SHARED_RUNQ is really to drive up CPU util. I
> don't think we're really doing the user many favors if we try to guess
> for them when they actually want that to happen.
> 
> If there's a time where we _really_ don't want or need to do it then
> absolutely, let's skip it. But I would really like to see this go in
> without checks on max_newidle_lb_cost, etc. The whole point of
> SHARED_RUNQ is that it should be less costly than iterating over O(n)
> cores to find tasks, so we _want_ to be more aggressive in doing so.

Agreed! I'll try to get a more palatable diff by the time you are
back from vacation.

> 
>> Most of the experimentation (except for rq lock contention using IBS)
>> was done by reading the newidle_balance() code.
> 
> And kudos for doing so!
> 
>> Finally a look at newidle_balance counts (tip vs tip + v3 + diff) for
>> 128-clients of tbench on the test machine:
>>
>>
>> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :     921871,   0	(diff: -100.00%)
>> --
>> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :     472412,   0	(diff: -100.00%)
>> --
>> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :        114, 279	(diff: +144.74%)
>> --
>> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          9,   9	(diff: +00.00%)
>> --
>>
>> Let me know if you have any queries. I'll go back and try to bisect the
>> diff to see if only a couple of changes that I thought were important
>> are good enought to yield back the lost performance. I'll do wider
>> testing post hearing your thoughts.
> 
> Hopefully my feedback above should give you enough context to bisect and
> find the changes that we really think are most helpful? To reiterate: I
> definitely think your change to avoid iterating over the LLC sd is
> correct and makes sense. Others possibly do as well such as checking
> rd->overload (though not 100% sure), but others such as the
> max_newidle_lb_cost checks I would strongly prefer to avoid.
> 
> Prateek -- thank you for doing all of this work, it's very much
> appreciated.

And thank you for patiently going through it all and clarifying all the
bits I have been missing :)

> 
> As I mentioned on the other email, I'll be on vacation for about a week
> and a half starting tomorrow afternoon, so I may be slow to respond in
> that time.

Enjoy your vacation! I'll keep the tread updated for you to read when
you get back :)

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek