linux-kernel - Re: [RFC v4 0/6] CPU reclaiming for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170104131755.573651ca@sweethome>
Date:   Wed, 4 Jan 2017 13:17:55 +0100
From:   luca abeni <luca.abeni@...tn.it>
To:     Daniel Bristot de Oliveira <bristot@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@....com>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

Hi Daniel,

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@...hat.com> wrote:

[...]
> In a four core box, if I dispatch 11 tasks [1] with setup:
> 
>   period = 30 ms
>   runtime = 10 ms
>   flags = 0 (GRUB disabled)
> 
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
> 
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected.

Notice that, if I understand well, each task should receive 33.33% (1/3)
of CPU time. Anyway, I think this is ok...

> However, if I enable GRUB in the same task set I get this:
> 
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
> 
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.

What happened here is that some runqueues have an active utilisation
larger than 0.95. So, GRUB is decreasing the amount of time received by
the tasks on those runqueues to consume less than 95%... This is the
reason for the effect you noticed below:


> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
[...]

Now, this is "kind of expected", because you have 11 tasks each one
having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
not result in what you have seen in htop...
The real issue seems to be that at some point some runqueues have an
active utilisation = 1.33 (4 dl tasks in the runqueue), with other
runqueues only having 2 tasks... And this results in the huge imbalance
in utilisations you noticed. I am trying to understand why this
happens... It seems to me that a "pull_dl_task()" might end up pulling
more than 1 task... Is this possible?


			Luca

> 
> You can see this with this code snip:
> 
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> -		delta_exec = grub_reclaim(delta_exec, rq);
> -	dl_se->runtime -= delta_exec;
> +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> +		u64 new_delta_exec;
> +		new_delta_exec = grub_reclaim(delta_exec, rq);
> +		if (new_delta_exec > delta_exec)
> +			trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> +					new_delta_exec,
> +					delta_exec,
> +					(new_delta_exec -
> delta_exec));
> +		dl_se->runtime -= new_delta_exec;
> +	}
> +	else {
> +		dl_se->runtime -= delta_exec;
> +	}
> +
> +
>  
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------  
> 
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
>                g-1984  [001] d.h1  1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981  [003]
> d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
>                g-1748  [001] d.h1   418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748  [001]
> d.h1   418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
> 
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
> 
> That is all I could find today. Am I missing something?
> 
> [1] http://bristot.me/lkml/d.c
> 
> -- Daniel