linux-kernel - Re: [RFC] sched/deadline: Prevent rt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20140225151515.617714e2f2cd6c558531ba61@gmail.com>
Date:	Tue, 25 Feb 2014 15:15:15 +0100
From:	Juri Lelli <juri.lelli@...il.com>
To:	tkhai@...dex.ru
Cc:	Peter Zijlstra <peterz@...radead.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: [RFC] sched/deadline: Prevent rt_time growth to infinity

On Sat, 22 Feb 2014 04:56:59 +0400
Kirill Tkhai <tkhai@...dex.ru> wrote:

> On 21.02.2014 20:36, Juri Lelli wrote:
> > On Fri, 21 Feb 2014 11:37:15 +0100
> > Peter Zijlstra <peterz@...radead.org> wrote:
> > 
> >> On Thu, Feb 20, 2014 at 02:16:00AM +0400, Kirill Tkhai wrote:
> >>> Since deadline tasks share rt bandwidth, we must care about
> >>> bandwidth timer set. Otherwise rt_time may grow up to infinity
> >>> in update_curr_dl(), if there are no other available RT tasks
> >>> on top level bandwidth.
> >>>
> >>> I'm going to decide the problem the way below. Almost untested
> >>> because of I skipped almost all of recent patches which haveto be applied from lkml.
> >>>
> >>> Please say, if I skipped anything in idea. Maybe better put
> >>> start_top_rt_bandwidth() into set_curr_task_dl()?
> >>
> >> How about we only increment rt_time when there's an RT bandwidth timer
> >> active?
> >>
> >>
> >> ---
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -568,6 +568,12 @@ static inline struct rt_bandwidth *sched
> >>  
> >>  #endif /* CONFIG_RT_GROUP_SCHED */
> >>  
> >> +bool sched_rt_bandwidth_active(struct rt_rq *rt_rq)
> >> +{
> >> +	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
> >> +	return hrtimer_active(&rt_b->rt_period_timer);
> >> +}
> >> +
> >>  #ifdef CONFIG_SMP
> >>  /*
> >>   * We ran out of runtime, see if we can borrow some from our neighbours.
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -587,6 +587,8 @@ int dl_runtime_exceeded(struct rq *rq, s
> >>  	return 1;
> >>  }
> >>  
> >> +extern bool sched_rt_bandwidth_active(struct rt_rq *rt_rq);
> >> +
> >>  /*
> >>   * Update the current task's runtime statistics (provided it is still
> >>   * a -deadline task and has not been removed from the dl_rq).
> >> @@ -650,11 +652,13 @@ static void update_curr_dl(struct rq *rq
> >>  		struct rt_rq *rt_rq = &rq->rt;
> >>  
> >>  		raw_spin_lock(&rt_rq->rt_runtime_lock);
> >> -		rt_rq->rt_time += delta_exec;
> >>  		/*
> >>  		 * We'll let actual RT tasks worry about the overflow here, we
> >> -		 * have our own CBS to keep us inline -- see above.
> >> +		 * have our own CBS to keep us inline; only account when RT
> >> +		 * bandwidth is relevant.
> >>  		 */
> >> +		if (sched_rt_bandwidth_active(rt_rq))
> >> +			rt_rq->rt_time += delta_exec;
> >>  		raw_spin_unlock(&rt_rq->rt_runtime_lock);
> >>  	}
> >>  }
> > 
> > So, I ran some tests with the above and I'd like to share with you what
> > I've found. You can find here a trace-cmd trace that should be feeded
> > to kernelshark to be able to understand what follows (or feel free to
> > reproduce same scenario :)):
> > http://retis.sssup.it/~jlelli/traces/trace_rt_time.dat
> > 
> > Here you have a DL task (4/10) and a while(1) RT task, both running
> > inside a rt_bw of 0.5. RT tasks is activated 500ms after DL. As I
> > filtered in sched_rt_period_timer(), you can search for time instants
> > when the rt_bw is replenished. It is evident that the first time after
> > rt timer is activated back (search for start_bandwidth_timer), we can
> > eat some bw to FAIR tasks (if any). This is due to the fact that we
> > reset rt_bw budget at this time, start decrementing rt_time for both DL
> > and RT tasks, throttle RT tasks when rt_time > runtime, but, since DL
> > tasks acually executes inside their own server, they don't care about
> > rt_bw. Good news is that steady state is ok: keeping track of overruns
> > we are able to stop eating bw to other guys.
> > 
> > My thougths:
> > 
> >  - Peter's patch is an easy fix to Kirill's problem (RT tasks were
> >    throttled too early);
> >  - something to add to this solution could be to pre-calculate bw of
> >    ready DL tasks and subtract it to rt_bw at replenishment time, but
> >    it sounds quite awkward, pessimistic, and I'm not sure it is gonna
> >    work;
> >  - we are stealing bw to best-effort tasks, and just at the beginning
> >    of the transistion, is it really a problem?
> >  - I mean, if you want guarantees make your tasks DL! :);
> >  - in the long run we are gonna have RT tasks scheduled inside CBS
> >    servers, and all this will be properly fixed up.
> > 
> > Comments?
> > 
> > BTW, rt timer activation/deactivation should probably be fixed for
> > !RT_GROUP_SCHED with something like this:
> > 
> > ---
> >  kernel/sched/rt.c |   10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 6161de8..274f992 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -86,12 +86,12 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
> >  	raw_spin_lock_init(&rt_rq->rt_runtime_lock);
> >  }
> >  
> > -#ifdef CONFIG_RT_GROUP_SCHED
> >  static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b)
> >  {
> >  	hrtimer_cancel(&rt_b->rt_period_timer);
> >  }
> >  
> > +#ifdef CONFIG_RT_GROUP_SCHED
> >  #define rt_entity_is_task(rt_se) (!(rt_se)->my_q)
> >  
> >  static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se)
> > @@ -1017,8 +1017,12 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
> >  	start_rt_bandwidth(&def_rt_bandwidth);
> >  }
> >  
> > -static inline
> > -void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {}
> > +static void
> > +dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
> > +{
> > +	if (!rt_rq->rt_nr_running)
> > +		destroy_rt_bandwidth(&def_rt_bandwidth);
> > +}
> >  
> >  #endif /* CONFIG_RT_GROUP_SCHED */
> >  
> 
> It looks with both patches applied, we may get into a situation,
> when all CPU time is shared between RT and DL tasks:
> 
> rt_runtime = n
> rt_period  = 2n
> 
> | RT working, DL sleeping  | DL working, RT sleeping      |
> -----------------------------------------------------------
> | (1)     duration = n     | (2)     duration = n         | (repeat)
> |--------------------------|------------------------------|
> | (rt_bw timer is running) | (rt_bw timer is not running) |
> 
> No time for fair tasks at all.

Ok, this situation is pathological. DL bandwidth is guaranteed at
admission control, while RT isn't. In this case RT tasks are doomed by
construction. Still you'd like to let FAIR tasks execute :).

I argumented on a slightly different solution in what follows, what you
think?

Thanks,

- Juri

>From e44fe2eef34433a7799cfc153f467f7c62813596 Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@...il.com>
Date: Fri, 21 Feb 2014 11:37:15 +0100
Subject: [PATCH] sched/deadline: Prevent rt_time growth to infinity

Kirill Tkhai noted:
Since deadline tasks share rt bandwidth, we must care about
bandwidth timer set. Otherwise rt_time may grow up to infinity
in update_curr_dl(), if there are no other available RT tasks
on top level bandwidth.

RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).

Peter than proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.

Indeed, Kirill noted again:
It looks we may get into a situation, when all CPU time is shared
between RT and DL tasks:

rt_runtime = n
rt_period  = 2n

| RT working, DL sleeping  | DL working, RT sleeping      |
-----------------------------------------------------------
| (1)     duration = n     | (2)     duration = n         | (repeat)
|--------------------------|------------------------------|
| (rt_bw timer is running) | (rt_bw timer is not running) |

No time for fair tasks at all.

While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!

What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.

This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).

This cures both problems:

 - no matter how many DL instances in the past, you'll have an rt_time
   slightly above rt_runtime when an RT task is enqueued, and from that
   point on (after the first replenishment), the task will normally execute;

 - you can still eat up all bandwidth during the first period, but not
   anymore after that, remember that DL execution will increment rt_time
   till the upper limit is reached.

The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.

Signed-off-by: Juri Lelli <juri.lelli@...il.com>
---
 kernel/sched/deadline.c |    8 ++++++--
 kernel/sched/rt.c       |    8 ++++++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 15cbc17..f59d774 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -564,6 +564,8 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se)
 	return 1;
 }
 
+extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
+
 /*
  * Update the current task's runtime statistics (provided it is still
  * a -deadline task and has not been removed from the dl_rq).
@@ -627,11 +629,13 @@ static void update_curr_dl(struct rq *rq)
 		struct rt_rq *rt_rq = &rq->rt;
 
 		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_time += delta_exec;
 		/*
 		 * We'll let actual RT tasks worry about the overflow here, we
-		 * have our own CBS to keep us inline -- see above.
+		 * have our own CBS to keep us inline; only account when RT
+		 * bandwidth is relevant.
 		 */
+		if (sched_rt_bandwidth_account(rt_rq))
+			rt_rq->rt_time += delta_exec;
 		raw_spin_unlock(&rt_rq->rt_runtime_lock);
 	}
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7dba25a..7f372e1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -538,6 +538,14 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq)
 
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
+{
+	struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
+
+	return (hrtimer_active(&rt_b->rt_period_timer) ||
+		rt_rq->rt_time < rt_b->rt_runtime);
+}
+
 #ifdef CONFIG_SMP
 /*
  * We ran out of runtime, see if we can borrow some from our neighbours.
-- 
1.7.9.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/