linux-kernel - Re: [RFD] sched/deadline: Support single CPU affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161110122100.GA30974@sisyphus.home.austad.us>
Date:   Thu, 10 Nov 2016 13:21:00 +0100
From:   Henrik Austad <henrik@...tad.us>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Juri Lelli <juri.lelli@...il.com>,
        Luca Abeni <luca.abeni@...tn.it>,
        Steven Rostedt <rostedt@...dmis.org>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>,
        Daniel Bistrot de Oliveira <danielbristot@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFD] sched/deadline: Support single CPU affinity

On Thu, Nov 10, 2016 at 09:08:07AM +0100, Peter Zijlstra wrote:
> 
> 
> Add support for single CPU affinity to SCHED_DEADLINE; the supposed reason for
> wanting single CPU affinity is better QoS than provided by G-EDF.
> 
> Therefore the aim is to provide harder guarantees, similar to UP, for single
> CPU affine tasks. This then leads to a mixed criticality scheduling
> requirement for the CPU scheduler. G-EDF like for the non-affine (global)
> tasks and UP like for the single CPU tasks.
> 
> 
> 
> ADMISSION CONTROL
> 
> Do simple UP admission control on the CPU local tasks, and subtract the
> admitted bandwidth from the global total when doing global admission control.
> 
>   single cpu:	U[n] := \Sum tl_u,n <= 1
>   global:	\Sum tg_u <= N - \Sum U[n]
> 
> 
> 
> MIXED CRITICALITY SCHEDULING
> 
> Since we want to provide better guarantees for single CPU affine tasks than
> the G-EDF scheduler provides for the single CPU tasks, we need to somehow
> alter the scheduling algorithm.
> 
> The trivial layered EDF/G-EDF approach is obviously flawed in that it will
> result in many unnecessary deadline misses. The trivial example is having a
> single CPU task with a deadline after a runnable global task. By always
> running single CPU tasks over global tasks we can make the global task miss
> its deadline even though we could easily have ran both within the alloted
> time.
> 
> Therefore we must use a more complicated scheme. By adding a second measure
> present in the sporadic task model to the scheduling function we can try and
> distinguish between the constraints of handling the two cases in a single
> scheduler.
> 
> We define the time to fail as:
> 
>   ttf(t) := t_d - t_b; where
> 
> 	t_d is t's absolute deadline
> 	t_b is t's remaining budget
> 
> This is the last possible moment we must schedule this task such that it can
> complete its work and not miss its deadline.

To elaborate a bit on this (this is a modified LLF approach if my memory 
serves):

You have the dynamic time-to-failure (TtF), i.e. as the task progresses 
(scheduled to run), the relative time-to-failure will remain constant. This 
can be used to compare thasks to a running task and should minimize the 
number of calculations required.

Then you have the static Time-of-failure (ToF, which is the absoulte time 
when a task will no longer be able to meet its deadline. This is what you 
use for keeping a sorted list of tasks in the runqueue. As this is a fixed 
point in time, you do not have to dynamically update or do crazy 
calculation when inserting/removing threads from the rq.

> If we then augment the regular EDF rules by, for local tasks, considering the
> time to fail and let this measure override the regular EDF pick when the
> time to fail can be overran by the EDF pick.

Then, if you do this - do you need to constrict this to a local CPU? I 
*think* you could do this in a global scheduler if you use ToF/TtF for all 
deadline-tasks, I think you should be able to meet deadlines.

I had a rant about this way back [1,2 Sec 11.4], I need to sit down and 
re-read most of it, it has been a few too many years, but the idea was to 
minimize the number of context-switches (which LLF is prone to get a lot 
of) as well as minimize the computational overhead by avoiding 
re-calculating time-of-failre/time-to-failre a lot.

> That is, when:
> 
> 	now + left_b > min(ttf)

Why not just  use ttf/tof for all deadline-tasks? We have all the 
information available anyway and it would probably make the internal logic 
easier?

> Use augmented RB-tree to store absolute deadlines of all rq local tasks and
> keep the heap sorted on the earliest time to fail of a locally affine task.
> 
> TODO
> 
>  - finish patch, this only sketches the outlines
>  - formal analysis of the proposed scheduling function; this is only a hunch.

I think you are on the right track, but then again, you agree with some of 
the stuff I messed around with a while a go, so no wonder I think you're 
right :)

1) https://lkml.org/lkml/2009/7/10/380
2) https://brage.bibsys.no/xmlui/bitstream/handle/11250/259744/347756_FULLTEXT01.pdf

-Henrik

> ---
>  include/linux/sched.h   |   1 +
>  kernel/sched/core.c     |  75 ++++++++++++++++++-------
>  kernel/sched/deadline.c | 142 ++++++++++++++++++++++++++++++++++++++++++++----
>  kernel/sched/sched.h    |  12 ++--
>  4 files changed, 191 insertions(+), 39 deletions(-)
>
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3762fe4e3a80..32f948615d4c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,7 @@ struct sched_rt_entity {
>  
>  struct sched_dl_entity {
>  	struct rb_node	rb_node;
> +	u64 __subtree_ttf;

Didn't you say augmented rb-tree?

>  	/*
>  	 * Original scheduling parameters. Copied here from sched_attr
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bee18baa603a..46995c060a89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2469,6 +2469,8 @@ unsigned long to_ratio(u64 period, u64 runtime)
>  }
>  
>  #ifdef CONFIG_SMP
> +DEFINE_PER_CPU(u64, dl_cpu_bw);
> +
>  inline struct dl_bw *dl_bw_of(int i)
>  {
>  	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> @@ -2476,17 +2478,39 @@ inline struct dl_bw *dl_bw_of(int i)
>  	return &cpu_rq(i)->rd->dl_bw;
>  }
>  
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
> +					      int new_cpu, u64 new_bw)
>  {
> -	struct root_domain *rd = cpu_rq(i)->rd;
> -	int cpus = 0;
> +	struct root_domain *rd = container_of(dl_b, struct root_domain, dl_bw);
> +	u64 total, avail, used;
> +	int i;
>  
> -	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
> -			 "sched RCU must be held");
> -	for_each_cpu_and(i, rd->span, cpu_active_mask)
> -		cpus++;
> +	if (dl_b->bw == -1) /* admission control disabled */
> +		return false;
> +
> +	avail = 0;
> +	for_each_cpu_and(i, rd->span, cpu_active_mask) {
> +		total = dl_b->bw;
> +		used = per_cpu(dl_cpu_bw, i);
> +
> +		if (old_cpu == i)
> +			used -= old_bw;
> +		if (new_cpu == i)
> +			used += new_bw;
> +
> +		if (used > total)
> +			return true;

What happens in this case? Wouldn't it make sense to test to see if the 
new_cpu has the capacity to swap tasks _before_ updating the value?

> +		avail += total - used;
> +	}

So some scenarios:
1) Global task to affiniy of a single core:
The total bandwidth is ok, so we only have to check if the local CPU can 
accomodate the task. Then we must decrement the global dl_bw pool.

2) Local task to global: we only need to check if the global pool has room 
for the task, then we update it and decrease the sued bw for the local CPU

3) Local to local (move cpu). We need only check used for the new CPU, and 
if that is ok, the move is straightforward.

I am probably missing out on a whole lot of details here, but I cannot help 
but feel that this is a bit too complex for the task at hand.

> +	total = dl_b->total_bw;
> +	if (old_cpu == -1)
> +		total -= old_bw;
> +	if (new_cpu == -1)
> +		total += new_bw;

Why not keep the globally used and available bw in the root-domain? that 
would avoid iterating over all the active CPUs every time you do a 
bw-change.

CPU hotplug would probably be interseting though..

> -	return cpus;
> +	return avail < total;
>  }
>  #else
>  inline struct dl_bw *dl_bw_of(int i)
> @@ -2494,12 +2518,21 @@ inline struct dl_bw *dl_bw_of(int i)
>  	return &cpu_rq(i)->dl.dl_bw;
>  }
>  
> -static inline int dl_bw_cpus(int i)
> +static bool __dl_overflow(struct dl_bw *dl_b, int old_cpu, u64 old_bw,
> +					      int new_cpu, u64 new_bw)
>  {
> -	return 1;
> +	u64 avail;
> +
> +	if (dl_b->bw == -1) /* admission control disabled */
> +		return false;
> +
> +	avail = dl_b->bw;
> +
> +	return avail < dl_b->total_bw - old_bw + new_bw;
>  }
>  #endif
>  
> +
>  /*
>   * We must be sure that accepting a new task (or allowing changing the
>   * parameters of an existing one) is consistent with the bandwidth
> @@ -2519,7 +2552,7 @@ static int dl_overflow(struct task_struct *p, int policy,
>  	u64 period = attr->sched_period ?: attr->sched_deadline;
>  	u64 runtime = attr->sched_runtime;
>  	u64 new_bw = dl_policy(policy) ? to_ratio(period, runtime) : 0;
> -	int cpus, err = -1;
> +	int err = -1;
>  
>  	/* !deadline task may carry old deadline bandwidth */
>  	if (new_bw == p->dl.dl_bw && task_has_dl_policy(p))
> @@ -2531,18 +2564,22 @@ static int dl_overflow(struct task_struct *p, int policy,
>  	 * allocated bandwidth of the container.
>  	 */
>  	raw_spin_lock(&dl_b->lock);
> -	cpus = dl_bw_cpus(task_cpu(p));
> +	/* new DL task */
>  	if (dl_policy(policy) && !task_has_dl_policy(p) &&
> -	    !__dl_overflow(dl_b, cpus, 0, new_bw)) {
> +	    !__dl_overflow(dl_b, -1, 0, -1, new_bw)) {
>  		__dl_add(dl_b, new_bw);
>  		err = 0;
> +
> +	/* changed DL task */
>  	} else if (dl_policy(policy) && task_has_dl_policy(p) &&
> -		   !__dl_overflow(dl_b, cpus, p->dl.dl_bw, new_bw)) {
> -		__dl_clear(dl_b, p->dl.dl_bw);
> +		   !__dl_overflow(dl_b, -1, p->dl.dl_bw, -1, new_bw)) {
> +		__dl_sub(dl_b, p->dl.dl_bw);
>  		__dl_add(dl_b, new_bw);
>  		err = 0;
> +
> +	/* leaves DL */
>  	} else if (!dl_policy(policy) && task_has_dl_policy(p)) {
> -		__dl_clear(dl_b, p->dl.dl_bw);
> +		__dl_sub(dl_b, p->dl.dl_bw);
>  		err = 0;
>  	}
>  	raw_spin_unlock(&dl_b->lock);
> @@ -5390,8 +5427,7 @@ int task_can_attach(struct task_struct *p,
>  		rcu_read_lock_sched();
>  		dl_b = dl_bw_of(dest_cpu);
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		cpus = dl_bw_cpus(dest_cpu);
> -		overflow = __dl_overflow(dl_b, cpus, 0, p->dl.dl_bw);
> +		overflow = __dl_overflow(dl_b, -1, 0, -1, p->dl.dl_bw);
>  		if (overflow)
>  			ret = -EBUSY;
>  		else {
> @@ -7309,8 +7345,7 @@ static int cpuset_cpu_inactive(unsigned int cpu)
>  		dl_b = dl_bw_of(cpu);
>  
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		cpus = dl_bw_cpus(cpu);
> -		overflow = __dl_overflow(dl_b, cpus, 0, 0);
> +		overflow = __dl_overflow(dl_b, -1, 0, -1, 0);
>  		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
>  
>  		rcu_read_unlock_sched();
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 37e2449186c4..4eb65ea9c100 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -778,6 +778,11 @@ static void update_curr_dl(struct rq *rq)
>  	}
>  }
>  
> +static inline struct sched_dl_entity *dl_of(struct rb_node *node)
> +{
> +	return rb_entry(node, struct sched_dl_entity, rb_node);
> +}
> +
>  #ifdef CONFIG_SMP
>  
>  static void inc_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
> @@ -805,9 +810,8 @@ static void dec_dl_deadline(struct dl_rq *dl_rq, u64 deadline)
>  		cpudl_clear(&rq->rd->cpudl, rq->cpu);
>  	} else {
>  		struct rb_node *leftmost = dl_rq->rb_leftmost;
> -		struct sched_dl_entity *entry;
> +		struct sched_dl_entity *entry = dl_of(leftmost);
>  
> -		entry = rb_entry(leftmost, struct sched_dl_entity, rb_node);

split this out perhaps as a general cleanup?

>  		dl_rq->earliest_dl.curr = entry->deadline;
>  		cpudl_set(&rq->rd->cpudl, rq->cpu, entry->deadline);
>  	}
> @@ -848,6 +852,69 @@ void dec_dl_tasks(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>  	dec_dl_migration(dl_se, dl_rq);
>  }
>  
> +static inline u64 entity_ttf(struct sched_dl_entity *dl_se)
> +{
> +	u64 ttf = U64_MAX;
> +
> +	if (dl_se->flags & DL_AFFINE)
> +		ttf = dl_se->deadline - dl_se->runtime;
> +
> +	return ttf;
> +}
> +
> +static inline u64
> +compute_subtree_ttf(struct sched_dl_entry *dl_se)
> +{
> +	u64 subtree_ttd, ttf = entity_ttf(dl_se);
> +
> +	if (dl_se->rb_node.rb_left) {
> +		subtree_ttf = dl_of(dl_se->rb_node.rb_left)->__subtree_ttf;
> +		ttf = min(ttf, subtree_ttf);
> +	}
> +	if (dl_se->rb_node.rb_right) {
> +		subtree_ttf = dl_of(dl_se->rb_node.rb_right)->__subtree_ttf;
> +		ttf = min(ttf, subtree_ttf);
> +	}
> +
> +	return ttf;
> +}
> +
> +static void ttf_propagate(struct rb_node *rb, struct rb_node *stop)
> +{
> +	while (rb != stop) {
> +		struct sched_dl_entity *dl_se = dl_of(rb);
> +		u64 subtree_ttf = compute_subtree_ttf(dl_se);
> +
> +		if (dl_se->__subtree_ttf == subtree_ttf)
> +			break;
> +
> +		rb = rb_parent(rb);
> +	}
> +}
> +
> +static void ttf_copy(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> +	struct sched_dl_entity *dl_old = dl_of(rb_old);
> +	struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> +	new->__subtree_ttf = old->__subtree_ttf;
> +}
> +
> +static void ttf_rotate(struct rb_node *rb_old, struct rb_node *rb_new)
> +{
> +	struct sched_dl_entity *dl_old = dl_of(rb_old);
> +	struct sched_dl_entity *dl_new = dl_of(rb_new);
> +
> +	new->__subtree_ttf = old->__subtree_ttf;
> +	old->__subtree_ttf = compute_subtree_ttf(old);
> +}
> +
> +static const struct rb_augment_callbacks ttf_callbacks = {
> +	.propagate	= ttf_propagate,
> +	.copy		= ttf_copy,
> +	.rotate		= ttf_rotate,
> +};
> +
>  static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  {
>  	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> @@ -855,15 +922,16 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  	struct rb_node *parent = NULL;
>  	struct sched_dl_entity *entry;
>  	int leftmost = 1;
> +	u64 ttf;
>  
>  	BUG_ON(!RB_EMPTY_NODE(&dl_se->rb_node));
>  
>  	while (*link) {
>  		parent = *link;
> -		entry = rb_entry(parent, struct sched_dl_entity, rb_node);
> -		if (dl_time_before(dl_se->deadline, entry->deadline))
> +		entry = dl_of(parent);
> +		if (dl_time_before(dl_se->deadline, entry->deadline)) {
>  			link = &parent->rb_left;
> -		else {
> +		} else {
>  			link = &parent->rb_right;
>  			leftmost = 0;
>  		}
> @@ -872,8 +940,12 @@ static void __enqueue_dl_entity(struct sched_dl_entity *dl_se)
>  	if (leftmost)
>  		dl_rq->rb_leftmost = &dl_se->rb_node;
>  
> +	dl_se->__subtree_ttf = ttf = entity_ttf(dl_se);
>  	rb_link_node(&dl_se->rb_node, parent, link);
> -	rb_insert_color(&dl_se->rb_node, &dl_rq->rb_root);
> +	rb_insert_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
> +
> +	if (dl_of(dl_rq->rb_root.rb_node)->__subtree_ttf == ttf)
> +		dl_rb->rb_least_ttf = dl_rq->rb_root.rb_node;
>  
>  	inc_dl_tasks(dl_se, dl_rq);
>  }
> @@ -892,9 +964,41 @@ static void __dequeue_dl_entity(struct sched_dl_entity *dl_se)
>  		dl_rq->rb_leftmost = next_node;
>  	}
>  
> -	rb_erase(&dl_se->rb_node, &dl_rq->rb_root);
> +	rb_erase_augmented(&dl_se->rb_node, &dl_rq->rb_root, &ttf_callbacks);
>  	RB_CLEAR_NODE(&dl_se->rb_node);
>  
> +	if (dl_rq->rb_least_ttf == &dl_se->rb_node) {
> +		struct rb_node **link = &dl_rq->rb_root.rb_node;
> +
> +		while (*link) {
> +			struct rb_node *node = *link;
> +			struct sched_dl_entity *entry = dl_of(node);
> +			u64 subtree_ttf = entry->__subtree_ttf;
> +
> +			if (subtree_ttf = U64_MAX) {
> +				dl_rq->rb_least_ttf = NULL;
> +				goto no_least_ttf;
> +			}
> +
> +			if (node->rb_left &&
> +			    dl_of(node->rb_left)->__subtree_ttf == subtree_ttf) {
> +				link = &node->rb_left;
> +				continue;
> +			} 
> +
> +			if (node->rb_right &&
> +			    dl_of(node->rb_right)->__subtree_ttf == subtree_ttf) {
> +				link = &node->rb_right;
> +				continue;
> +			}
> +
> +			break;
> +		}
> +
> +		dl_rq->rb_least_ttf = *link;
> +no_least_ttf:
> +	}
> +
>  	dec_dl_tasks(dl_se, dl_rq);
>  }
>  
> @@ -1109,12 +1213,28 @@ static void start_hrtick_dl(struct rq *rq, struct task_struct *p)
>  static struct sched_dl_entity *pick_next_dl_entity(struct rq *rq,
>  						   struct dl_rq *dl_rq)
>  {
> -	struct rb_node *left = dl_rq->rb_leftmost;
> +	struct rb_node *rb_left = dl_rq->rb_leftmost;
> +	struct sched_dl_entity *left, *least;
> +	u64 now;
>  
> -	if (!left)
> +	if (!rb_left)
>  		return NULL;
>  
> -	return rb_entry(left, struct sched_dl_entity, rb_node);
> +	left = dl_of(rb_left);
> +
> +	/*
> +	 * Only local tasks have TTF set, if one is present and it would miss
> +	 * its deadline because of the (G)EDF task selection, override that
> +	 * and select the local task.
> +	 */
> +	if (dl_rq->rb_least_ttf) {
> +		least = dl_of(dl_rq->rb_least_ttf);
> +		now = rq_clock_task(rq);
> +		if (now + left->runtime > entity_ttf(least))
> +			return least;
> +	}

What happens if TtF for the currently enqueued task is 0 or very small?

(this is the problem with unmodified LLF, as you risk getting a storm of 
sched_switches when 2 or a few threads have very small relative 
time-to-failure, and the reason why the approach I mentioned up at the 
beginning makes this a bit easier to reason about).

You really, really want to minimize the number of context-switches

In general, once a task's relative time to failure is smaller than the 
smallest TtF of all the running tasks, you should switch it in with the 
task with the largest TtF. Since you are doing a partitioned approach, this 
global approach won't work so well, so to avoid the storm, you need to set 
a threshold and once you've swapped, you should let that task run until 
completion and let the other task fail before swapping.

> +	return left;
>  }
>  
>  struct task_struct *
> @@ -1645,7 +1765,7 @@ static void set_cpus_allowed_dl(struct task_struct *p,
>  		 * until we complete the update.
>  		 */
>  		raw_spin_lock(&src_dl_b->lock);
> -		__dl_clear(src_dl_b, p->dl.dl_bw);
> +		__dl_sub(src_dl_b, p->dl.dl_bw);
>  		raw_spin_unlock(&src_dl_b->lock);
>  	}
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 055f935d4421..55059d2568bc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -198,13 +198,15 @@ static inline int dl_bandwidth_enabled(void)
>  
>  extern struct dl_bw *dl_bw_of(int i);
>  
> +extern DECLARE_PER_CPU(u64, dl_cpu_bw);
> +
>  struct dl_bw {
>  	raw_spinlock_t lock;
>  	u64 bw, total_bw;
>  };
>  
>  static inline
> -void __dl_clear(struct dl_bw *dl_b, u64 tsk_bw)
> +void __dl_sub(struct dl_bw *dl_b, u64 tsk_bw)
>  {
>  	dl_b->total_bw -= tsk_bw;
>  }
> @@ -215,13 +217,6 @@ void __dl_add(struct dl_bw *dl_b, u64 tsk_bw)
>  	dl_b->total_bw += tsk_bw;
>  }
>  
> -static inline
> -bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw)
> -{
> -	return dl_b->bw != -1 &&
> -	       dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw;
> -}
> -
>  extern struct mutex sched_domains_mutex;
>  
>  #ifdef CONFIG_CGROUP_SCHED
> @@ -507,6 +502,7 @@ struct dl_rq {
>  	/* runqueue is an rbtree, ordered by deadline */
>  	struct rb_root rb_root;
>  	struct rb_node *rb_leftmost;
> +	struct rb_node *rb_least_ttf;
>  
>  	unsigned long dl_nr_running;

Some nitpick, and probably a lot of misunderstandings from my side, but all 
in all, I'm enthusiastic to the approach of using time spent on the CPU, 
remaining time and laxity to enhance the deadline-scheduler.

-Henrik

Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)