linux-kernel - Re: [RFC v4 0/6] CPU reclaiming for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ede4cd40-6be1-946d-bfec-b97e097b075d@redhat.com>
Date:   Wed, 4 Jan 2017 19:00:11 +0100
From:   Daniel Bristot de Oliveira <bristot@...hat.com>
To:     Luca Abeni <luca.abeni@...tn.it>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@....com>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

On 01/04/2017 05:42 PM, Luca Abeni wrote:
> Hi Daniel,
> 
> 2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bristot@...hat.com>:
>> On 01/04/2017 01:17 PM, luca abeni wrote:
>>> Hi Daniel,
>>>
>>> On Tue, 3 Jan 2017 19:58:38 +0100
>>> Daniel Bristot de Oliveira <bristot@...hat.com> wrote:
>>>
>>> [...]
>>>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>>>
>>>>   period = 30 ms
>>>>   runtime = 10 ms
>>>>   flags = 0 (GRUB disabled)
>>>>
>>>> I see this:
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
>>>> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
>>>> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
>>>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>>>   Swp[                  0K/3.90G]
>>>>
>>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
>>>> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
>>>> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
>>>> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
>>>> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
>>>> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
>>>> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
>>>> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
>>>> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
>>>> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
>>>> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
>>>> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>>>>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
>>>> gnome-ter ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> All tasks are using +- the same amount of CPU time, a little bit more
>>>> than 30%, as expected.
>>>
>>> Notice that, if I understand well, each task should receive 33.33% (1/3)
>>> of CPU time. Anyway, I think this is ok...
>>
>> If we think on a partitioned system, yes for the CPUs in which 3 'd'
>> tasks are able to run. But as sched deadline is global by definition,
>> the load is:
>>
>> SUM(U_i)  / M processors.
>>
>> 1/3 * 11  / 4            = 0.916666667
>>
>> So 10/30 (1/3) of this workload is:
>> 91.6 / 3 = 30.533333333
>>
>> Well, the rest is probably overheads, like scheduling, migration...
> 
> I do not think this math is correct... Yes, the total utilization of
> the taskset is 0.91 (or 3.66, depending on how you define the
> utilization...), but I still think that the percentage of CPU time
> shown by "top" or "htop" should be 33.33 (or 8.33, depending on how
> the tool computes it).
> runtime=10 and period=30 means "schedule the task for 10ms every
> 30ms", so the task will consume 33% of the CPU time of a single core.
> In other words, 10/30 is a fraction of the CPU time, not a fraction of
> the time consumed by SCHED_DEADLINE tasks.

Ack! you are correct, I was so focused on global utilization that end up
missing this point. For the top/htop it should 33.3%.

> 
>>>> However, if I enable GRUB in the same task set I get this:
>>>>
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
>>>> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
>>>> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
>>>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>>>   Swp[                  0K/3.90G]
>>>>
>>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
>>>> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
>>>> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
>>>> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
>>>> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
>>>> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
>>>> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
>>>> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
>>>> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
>>>> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
>>>> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
>>>> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>>>>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
>>>> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
>>>> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
>>>> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> Some tasks start to use more CPU time, while others seems to use less
>>>> CPU than it was reserved for them. See the task 14926, it is using
>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>>
>>> What happened here is that some runqueues have an active utilisation
>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>>> the tasks on those runqueues to consume less than 95%... This is the
>>> reason for the effect you noticed below:
>>
>> I see. But, AFAIK, the Linux's sched deadline measures the load
>> globally, not locally. So, it is not a problem having a load > than 95%
>> in the local queue if the global queue is < 95%.
>>
>> Am I missing something?
> 
> The version of GRUB reclaiming implemented in my patches tracks a
> per-runqueue "active utilization", and uses it for reclaiming.

I _think_ that this might be (one of) the source(s) of the problem...

Just exercising...

For example, with my taskset, with a hypothetical perfect balance of the
whole runqueue, one possible scenario is:

   CPU    0    1     2     3
# TASKS   3    3     3     2

In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
current task on these CPUs will have their runtime decreased by GRUB.
Meanwhile, the luck tasks in the CPU 3 would use an additional time that
they "globally" do not have - because the system, globally, has a load
higher than the 66.6...% of the local runqueue. Actually, part of the
time decreased from tasks on [0-2] are being used by the tasks on 3,
until the next migration of any task, which will change the luck
tasks... but without any guaranty that all tasks will be the luck one on
every activation, causing the problem.

Does it make sense?

If it does, this let me think that only with the global track of
utilization we will achieve the correct result... but I may be missing
something... :-).

-- Daniel