linux-kernel - Re: [PATCH 11/15] sched,fair: flatten hierarchical runqueues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <967114b2-15a7-b445-3133-074732b20e34@arm.com>
Date:   Fri, 23 Aug 2019 20:14:41 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Rik van Riel <riel@...riel.com>, linux-kernel@...r.kernel.org
Cc:     kernel-team@...com, pjt@...gle.com, peterz@...radead.org,
        mingo@...hat.com, morten.rasmussen@....com, tglx@...utronix.de,
        mgorman@...hsingularity.net, vincent.guittot@...aro.org
Subject: Re: [PATCH 11/15] sched,fair: flatten hierarchical runqueues

On 22/08/2019 04:17, Rik van Riel wrote:
> Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.
> 
> Iteration of the sched_entity hierarchy is rate limited to once per jiffy
> per sched_entity, which is a smaller change than it seems, because load
> average adjustments were already rate limited to once per jiffy before this
> patch series.
> 
> This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
> from throttled cgroups onto their cgroup runqueues, and slowly (using the
> GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
> gets unthrottled, to prevent thundering herd issues.
> 
> Signed-off-by: Rik van Riel <riel@...riel.com>
> 
> Header from folded patch 'fix-attach-detach_enticy_cfs_rq.patch~':
> 
> Subject: sched,fair: fix attach/detach_entity_cfs_rq
> 
> While attach_entity_cfs_rq and detach_entity_cfs_rq should iterate over
> the hierarchy, they do not need to so that twice.
> 
> Passing flags into propagate_entity_cfs_rq allows us to reuse that same
> loop from other functions.
> 
> Signed-off-by: Rik van Riel <riel@...riel.com>
> 
> 
> Header from folded patch 'enqueue-order.patch':
> 
> Subject: sched,fair: better ordering at enqueue_task_fair time
> 
> In order to get useful numbers for the task's hierarchical weight,
> task priority, etc things need to be done in a certain order at task
> enqueue time.
> 
> Specifically:
> 1) static load/weight to "local" cfs_rq
> 2) propagate load/weight up the tree
> 3) add runnable load avg to root cfs_rq
> 
> The reason is that each step depends on the things done by the
> step beforehand, and we can end up with nonsense numbers if we
> do not do things right.
> 
> Also, make sure that we walk all the way up the hierarchy at
> enqueue_task_fair time in order to get the benefit from the ramp-up
> logic in update_cfs_group.

[...]

>  /*
> @@ -6953,7 +6849,6 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
>  	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
>  		return;
>  
> -	find_matching_se(&se, &pse);
>  	update_curr(cfs_rq_of(se));
>  	BUG_ON(!pse);
>  	if (wakeup_preempt_entity(se, pse) == 1) {
> @@ -6994,100 +6889,18 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  	struct task_struct *p;
>  	int new_tasks;
>  
> +	put_prev_task(rq, prev);
>  again:
>  	if (!cfs_rq->nr_running)
>  		goto idle;
>  
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -	if (prev->sched_class != &fair_sched_class)
> -		goto simple;
> -
> -	/*
> -	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
> -	 * likely that a next task is from the same cgroup as the current.
> -	 *
> -	 * Therefore attempt to avoid putting and setting the entire cgroup
> -	 * hierarchy, only change the part that actually changes.
> -	 */
> -
> -	do {
> -		struct sched_entity *curr = cfs_rq->curr;
> -
> -		/*
> -		 * Since we got here without doing put_prev_entity() we also
> -		 * have to consider cfs_rq->curr. If it is still a runnable
> -		 * entity, update_curr() will update its vruntime, otherwise
> -		 * forget we've ever seen it.
> -		 */
> -		if (curr) {
> -			if (curr->on_rq)
> -				update_curr(cfs_rq);
> -			else
> -				curr = NULL;
> -
> -			/*
> -			 * This call to check_cfs_rq_runtime() will do the
> -			 * throttle and dequeue its entity in the parent(s).
> -			 * Therefore the nr_running test will indeed
> -			 * be correct.
> -			 */
> -			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
> -				cfs_rq = &rq->cfs;
> -
> -				if (!cfs_rq->nr_running)
> -					goto idle;
> -
> -				goto simple;
> -			}
> -		}
> -
> -		se = pick_next_entity(cfs_rq, curr);
> -		cfs_rq = group_cfs_rq(se);
> -	} while (cfs_rq);
> -
> -	p = task_of(se);
> -
> -	/*
> -	 * Since we haven't yet done put_prev_entity and if the selected task
> -	 * is a different task than we started out with, try and touch the
> -	 * least amount of cfs_rqs.
> -	 */
> -	if (prev != p) {
> -		struct sched_entity *pse = &prev->se;
> -
> -		while (!(cfs_rq = is_same_group(se, pse))) {
> -			int se_depth = se->depth;
> -			int pse_depth = pse->depth;
> -
> -			if (se_depth <= pse_depth) {
> -				put_prev_entity(cfs_rq_of(pse), pse);
> -				pse = parent_entity(pse);
> -			}
> -			if (se_depth >= pse_depth) {
> -				set_next_entity(cfs_rq_of(se), se);
> -				se = parent_entity(se);
> -			}

Looks like with the se->depth related code gone here in
pick_next_task_fair() and the call to find_matching_se() in
check_preempt_wakeup() you could remove se->depth entirely.

[...]