linux-kernel - Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a3710be8-c896-f997-c856-38966459edc2@amd.com>
Date:   Thu, 31 Aug 2023 09:17:55 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     David Vernet <void@...ifault.com>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, tj@...nel.org,
        roman.gushchin@...ux.dev, gautham.shenoy@....com,
        aaron.lu@...el.com, wuyun.abel@...edance.com, kernel-team@...a.com
Subject: Re: [PATCH v3 6/7] sched: Implement shared runqueue in CFS

Hello David,

On 8/31/2023 7:04 AM, David Vernet wrote:
> On Wed, Aug 30, 2023 at 12:16:17PM +0530, K Prateek Nayak wrote:
>> Hello David,
> 
> Hello Prateek,
> 
>>
>> On 8/10/2023 3:42 AM, David Vernet wrote:
>>> [..snip..]
>>> +	if (p && is_cpu_allowed(p, cpu_of(rq)))
>>> +		list_del_init(&p->shared_runq_node);
>>
>> I wonder if we should remove the task from the list if
>> "is_cpu_allowed()" return false.
>>
>> Consider the following scenario: A task that does not sleep, is pinned
>> to single CPU. Since this is now at the head of the list, and cannot be
>> moved, we leave it there, but since the task also never sleeps, it'll
>> stay there, thus preventing the queue from doing its job.
> 
> Hmm, sorry, I may not be understanding your suggestion. If a task was
> pinned to a single CPU, it would be dequeued from the shared_runq before
> being pinned (see __set_cpus_allowed_ptr_locked()), and then would not
> be added back to the shard in shared_runq_enqueue_task() because of
> p->nr_cpus_allowed == 1. The task would also be dequeued from the shard
> before it started running (unless I'm misunderstanding what you mean by
> "a task that does not sleep"). Please let me know if I'm missing
> something.

Ah! My bad. Completely missed that. Thank you for clarifying.

> 
>> Further implication ...  
>>
>>> +	else
>>> +		p = NULL;
>>> +	raw_spin_unlock(&shared_runq->lock);
>>> +
>>> +	return p;
>>> +}
>>> +
>>> +static void shared_runq_push_task(struct rq *rq, struct task_struct *p)
>>> +{
>>> +	struct shared_runq *shared_runq;
>>> +
>>> +	shared_runq = rq_shared_runq(rq);
>>> +	raw_spin_lock(&shared_runq->lock);
>>> +	list_add_tail(&p->shared_runq_node, &shared_runq->list);
>>> +	raw_spin_unlock(&shared_runq->lock);
>>> +}
>>>  
>>>  static void shared_runq_enqueue_task(struct rq *rq, struct task_struct *p)
>>> -{}
>>> +{
>>> +	/*
>>> +	 * Only enqueue the task in the shared runqueue if:
>>> +	 *
>>> +	 * - SHARED_RUNQ is enabled
>>> +	 * - The task isn't pinned to a specific CPU
>>> +	 */
>>> +	if (p->nr_cpus_allowed == 1)
>>> +		return;
>>> +
>>> +	shared_runq_push_task(rq, p);
>>> +}
>>>  
>>>  static int shared_runq_pick_next_task(struct rq *rq, struct rq_flags *rf)
>>>  {
>>> -	return 0;
>>> +	struct task_struct *p = NULL;
>>> +	struct rq *src_rq;
>>> +	struct rq_flags src_rf;
>>> +	int ret = -1;
>>> +
>>> +	p = shared_runq_pop_task(rq);
>>> +	if (!p)
>>> +		return 0;
>>
>> ...
>>
>> Since we return 0 here in such a scenario, we'll take the old
>> newidle_balance() path but ...
>>
>>> +
>>> +	rq_unpin_lock(rq, rf);
>>> +	raw_spin_rq_unlock(rq);
>>> +
>>> +	src_rq = task_rq_lock(p, &src_rf);
>>> +
>>> +	if (task_on_rq_queued(p) && !task_on_cpu(src_rq, p)) {
>>> +		update_rq_clock(src_rq);
>>> +		src_rq = move_queued_task(src_rq, &src_rf, p, cpu_of(rq));
>>> +		ret = 1;
>>> +	}
>>> +
>>> +	if (src_rq != rq) {
>>> +		task_rq_unlock(src_rq, p, &src_rf);
>>> +		raw_spin_rq_lock(rq);
>>> +	} else {
>>> +		rq_unpin_lock(rq, &src_rf);
>>> +		raw_spin_unlock_irqrestore(&p->pi_lock, src_rf.flags);
>>> +	}
>>> +	rq_repin_lock(rq, rf);
>>> +
>>> +	return ret;
>>>  }
>>>  
>>>  static void shared_runq_dequeue_task(struct task_struct *p)
>>> -{}
>>> +{
>>> +	struct shared_runq *shared_runq;
>>> +
>>> +	if (!list_empty(&p->shared_runq_node)) {
>>> +		shared_runq = rq_shared_runq(task_rq(p));
>>> +		raw_spin_lock(&shared_runq->lock);
>>> +		/*
>>> +		 * Need to double-check for the list being empty to avoid
>>> +		 * racing with the list being drained on the domain recreation
>>> +		 * or SHARED_RUNQ feature enable / disable path.
>>> +		 */
>>> +		if (likely(!list_empty(&p->shared_runq_node)))
>>> +			list_del_init(&p->shared_runq_node);
>>> +		raw_spin_unlock(&shared_runq->lock);
>>> +	}
>>> +}
>>>  
>>>  /*
>>>   * For asym packing, by default the lower numbered CPU has higher priority.
>>> @@ -12093,6 +12308,16 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>  	rcu_read_lock();
>>>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>>  
>>> +	/*
>>> +	 * Skip <= LLC domains as they likely won't have any tasks if the
>>> +	 * shared runq is empty.
>>> +	 */
>>
>> ... now we skip all the way ahead of MC domain, overlooking any
>> imbalance that might still exist within the SMT and MC groups
>> since shared runq is not exactly empty.
>>
>> Let me know if I've got something wrong!
> 
> Yep, I mentioned this to Gautham as well in [0].
> 
> [0]: https://lore.kernel.org/all/20230818050355.GA5718@maniforge/
> 
> I agree that I think we should remove this heuristic from v4. Either
> that, or add logic to iterate over the shared_runq until a viable task
> is found.
> 
>>
>>> +	if (sched_feat(SHARED_RUNQ)) {
>>> +		sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>> +		if (likely(sd))
>>> +			sd = sd->parent;
>>> +	}
>>
>> Speaking of skipping ahead of MC domain, I don't think this actually
>> works since the domain traversal uses the "for_each_domain" macro
>> which is defined as:
> 
> *blinks*
> 
> Uhhh, yeah, wow. Good catch!
> 
>> #define for_each_domain(cpu, __sd) \
>> 	for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \
>> 			__sd; __sd = __sd->parent)
>>
>> The traversal starts from rq->sd overwriting your initialized value
>> here. This is why we see "load_balance count on cpu newly idle" in
>> Gautham's first report
>> (https://lore.kernel.org/lkml/ZN3dW5Gvcb0LFWjs@BLR-5CG11610CF.amd.com/)
>> to be non-zero.
>>
>> One way to do this would be as follows:
>>
>> static int newidle_balance() {
>>
>> 	...
>> 	for_each_domain(this_cpu, sd) {
>>
>> 		...
>> 		/* Skip balancing until LLc domain */
>> 		if (sched_feat(SHARED_RUNQ) &&
>> 		    (sd->flags & SD_SHARE_PKG_RESOURCES))
>> 			continue;
>>
>> 		...
>> 	}
>> 	...
>> }
> 
> Yep, I think this makes sense to do.
> 
>> With this I see the newidle balance count for SMT and MC domain
>> to be zero:
> 
> And indeed, I think this was the intention. Thanks again for catching
> this. I'm excited to try this out when running benchmarks for v4.
> 
>> < ----------------------------------------  Category:  newidle (SMT)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
>> --
>> < ----------------------------------------  Category:  newidle (MC)   ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :          0    $      0.000 $    [    0.00000 ]
>> --
>> < ----------------------------------------  Category:  newidle (DIE)  ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :       2170    $      9.319 $    [   17.42832 ]
>> --
>> < ----------------------------------------  Category:  newidle (NUMA) ---------------------------------------- >
>> load_balance cnt on cpu newly idle                         :         30    $    674.067 $    [    0.24094 ]
>> --
>>
>> Let me know if I'm missing something here :)
> 
> No, I think you're correct, we should be doing this. Assuming we want to
> keep this heuristic, I think the block above is also correct so that we
> properly account sd->max_newidle_lb_cost and rq->next_balance. Does that
> make sense to you too?

Yup, that makes sense!

> 
>>
>> Note: The lb counts for DIE and NUMA are down since I'm experimenting
>> with the implementation currently. I'll update of any new findings on
>> the thread.
> 
> Ack, thank you for doing that.
> 
> Just FYI, I'll be on vacation for about 1.5 weeks starting tomorrow
> afternoon. If I'm slow to respond, that's why.

Enjoy your vacation :)

I'll keep experimenting meanwhile and update the report in the
parallel thread.

> 
> Thanks,
> David

--
Thanks and Regards,
Prateek