linux-kernel - Re: [RFC v4 0/6] CPU reclaiming for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170103223301.0c6af9d2@sweethome>
Date:   Tue, 3 Jan 2017 22:33:01 +0100
From:   luca abeni <luca.abeni@...tn.it>
To:     Daniel Bristot de Oliveira <bristot@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@....com>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

Hi Daniel,
(sorry for the previous html email; I replied from my phone and I did
not realise how the email client was configured)

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@...hat.com> wrote:

[...]
> > The implemented CPU reclaiming algorithm is based on tracking the
> > utilization U_act of active tasks (first 2 patches), and modifying
> > the runtime accounting rule (see patch 0004). The original GRUB
> > algorithm is modified as described in [2] to support multiple CPUs
> > (the original algorithm only considered one single CPU, this one
> > tracks U_act per runqueue) and to leave an "unreclaimable" fraction
> > of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
> > original algorithm can consume 100% of the CPU time, starving all
> > the other tasks). Patch 0003 uses the newly introduced "inactive
> > timer" (introduced in patch 0002) to fix dl_overflow() and
> > __setparam_dl(). Patch 0006 allows to enable CPU reclaiming only on
> > selected tasks.  
> 
> Hi,
> 
> Today I did some tests in this patch set. Unfortunately, it seems that
> there is a problem :-(.
[...]
I reproduced this issue; thanks for the report. It seems to be due to
the fact that the reclaiming tasks are more than the CPU cores and the
load is very high (near to the utilisation limit).

I am investigating it, and will hopefully post an update in the next
days.



			Thanks,
				Luca


> 
> In a four core box, if I dispatch 11 tasks [1] with setup:
> 
>   period = 30 ms
>   runtime = 10 ms
>   flags = 0 (GRUB disabled)
> 
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
> 
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected. However, if I enable GRUB in the same task set
> I get this:
> 
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>   Swp[                  0K/3.90G]
> 
>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
> 
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> 
> I traced this task activation and noticed this:
> 
>          swapper     0 [003] 14968.332244: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.339294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7050 us
> (14968.339294 - 14968.332244)
> 
> period:  29997 us (14968.362241 - 14968.332244)
>          swapper     0 [003] 14968.362241: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.369294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 0.007053
> us (14968.369294 = 14968.362241)
> 
> period: 29994 us (14968.392235 - 14968.362241)
>          swapper     0 [003] 14968.392235: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.399301:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7066 us
> (14968.399301 - 14968.392235)
> 
> period:  30008 us (14968.422243 - 14968.392235)
>          swapper     0 [003] 14968.422243: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.429294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7051 us
> (14968.429294 - 14968.422243)
> 
> period:  29995 us (14968.452238 - 14968.422243)
>          swapper     0 [003] 14968.452238: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.459293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7055 us
> (14968.459293 - 14968.452238)
> 
> period:  30055 us (14968.482293 - 14968.452238)
>                g 14925 [003] 14968.482293: sched:sched_switch:
> g:14925 [-1] R ==> g:14926 [-1] g 14926 [003] 14968.490293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 8000 us
> (14968.490293 - 14968.482293)
> 
> The task is using less CPU than it was reserved/guaranteed.
> 
> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
> 
> You can see this with this code snip:
> 
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>  
>  	sched_rt_avg_update(rq, delta_exec);
>  
> -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> -		delta_exec = grub_reclaim(delta_exec, rq);
> -	dl_se->runtime -= delta_exec;
> +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> +		u64 new_delta_exec;
> +		new_delta_exec = grub_reclaim(delta_exec, rq);
> +		if (new_delta_exec > delta_exec)
> +			trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> +					new_delta_exec,
> +					delta_exec,
> +					(new_delta_exec -
> delta_exec));
> +		dl_se->runtime -= new_delta_exec;
> +	}
> +	else {
> +		dl_se->runtime -= delta_exec;
> +	}
> +
> +
>  
>  throttle:
>  	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------  
> 
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
>                g-1984  [001] d.h1  1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981  [003]
> d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
> 
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
>                g-1748  [001] d.h1   418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748  [001]
> d.h1   418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
> 
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
> 
> That is all I could find today. Am I missing something?
> 
> [1] http://bristot.me/lkml/d.c
> 
> -- Daniel