linux-kernel - Re: [RFC v4 0/6] CPU reclaiming for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKknFTBb0P4=-=MBrhYgFQvhqZUmAF70Uq987EKD7ui4WSuP9Q@mail.gmail.com>
Date:   Wed, 4 Jan 2017 17:42:14 +0100
From:   Luca Abeni <luca.abeni@...tn.it>
To:     Daniel Bristot de Oliveira <bristot@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@....com>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

Hi Daniel,

2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bristot@...hat.com>:
> On 01/04/2017 01:17 PM, luca abeni wrote:
>> Hi Daniel,
>>
>> On Tue, 3 Jan 2017 19:58:38 +0100
>> Daniel Bristot de Oliveira <bristot@...hat.com> wrote:
>>
>> [...]
>>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>>
>>>   period = 30 ms
>>>   runtime = 10 ms
>>>   flags = 0 (GRUB disabled)
>>>
>>> I see this:
>>> ------------------------------- HTOP
>>> ------------------------------------ 1
>>> [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running 2
>>> [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 3
>>> [|||||||||||||||||||||92.5%]   Uptime: 05:12:43 4
>>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>>   Swp[                  0K/3.90G]
>>>
>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>> 16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
>>> 16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
>>> 16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
>>> 16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
>>> 16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
>>> 16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
>>> 16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
>>> 16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
>>> 16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
>>> 16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
>>> 16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
>>> 16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
>>>  2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77
>>> gnome-ter ------------------------------- HTOP
>>> ------------------------------------
>>>
>>> All tasks are using +- the same amount of CPU time, a little bit more
>>> than 30%, as expected.
>>
>> Notice that, if I understand well, each task should receive 33.33% (1/3)
>> of CPU time. Anyway, I think this is ok...
>
> If we think on a partitioned system, yes for the CPUs in which 3 'd'
> tasks are able to run. But as sched deadline is global by definition,
> the load is:
>
> SUM(U_i)  / M processors.
>
> 1/3 * 11  / 4            = 0.916666667
>
> So 10/30 (1/3) of this workload is:
> 91.6 / 3 = 30.533333333
>
> Well, the rest is probably overheads, like scheduling, migration...

I do not think this math is correct... Yes, the total utilization of
the taskset is 0.91 (or 3.66, depending on how you define the
utilization...), but I still think that the percentage of CPU time
shown by "top" or "htop" should be 33.33 (or 8.33, depending on how
the tool computes it).
runtime=10 and period=30 means "schedule the task for 10ms every
30ms", so the task will consume 33% of the CPU time of a single core.
In other words, 10/30 is a fraction of the CPU time, not a fraction of
the time consumed by SCHED_DEADLINE tasks.


>>> However, if I enable GRUB in the same task set I get this:
>>>
>>> ------------------------------- HTOP
>>> ------------------------------------ 1
>>> [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running 2
>>> [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 3
>>> [|||||||||||||||||||||93.3%]   Uptime: 05:01:02 4
>>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>>   Swp[                  0K/3.90G]
>>>
>>>   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>> 14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
>>> 14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
>>> 14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
>>> 14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
>>> 14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
>>> 14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
>>> 14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
>>> 14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
>>> 14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
>>> 14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
>>> 14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
>>> 16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
>>>   862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30
>>> iio-sensor 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0
>>> 0:27.62 gnome-term 588 root       20   0  257M  121M  120M S  0.0
>>> 3.1  0:13.53 systemd-jo ------------------------------- HTOP
>>> ------------------------------------
>>>
>>> Some tasks start to use more CPU time, while others seems to use less
>>> CPU than it was reserved for them. See the task 14926, it is using
>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>
>> What happened here is that some runqueues have an active utilisation
>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>> the tasks on those runqueues to consume less than 95%... This is the
>> reason for the effect you noticed below:
>
> I see. But, AFAIK, the Linux's sched deadline measures the load
> globally, not locally. So, it is not a problem having a load > than 95%
> in the local queue if the global queue is < 95%.
>
> Am I missing something?

The version of GRUB reclaiming implemented in my patches tracks a
per-runqueue "active utilization", and uses it for reclaiming.

>>> After some debugging, it seems that in this case GRUB is also
>>> _reducing_ the runtime of the task by making the notion of consumed
>>> runtime be greater than the actual consumed runtime.
>> [...]
>>
>> Now, this is "kind of expected", because you have 11 tasks each one
>> having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
>> 3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
>> not result in what you have seen in htop...
>
> Well, the sched deadline aims to schedule the M highest priority tasks,
> and migrates tasks to achieve this goal. However, I am not sure if
> having the whole runqueue balance is a goal/restriction/feature of the
> deadline scheduler.
>
> Maybe this is the difference between the GRUB and sched deadline
> assumptions that is causing the problem. Just thinking aloud.

I think I found some strange behaviour in the push/pull mechanisms (at
least it seems strange to me): a "pull" operation might end up pulling
multiple tasks (I see this can simplify the implementation, but I
think pulling multiple tasks is useless and might introduce some
overhead even independently from my patches), and I suspect (but still
I need to verify this) a "push" operation can push a task to a "wrong"
destination runqueue (I mean, a task is pushed to a runqueue where it
is not the earliest deadline task)...

Without reclaiming, this just results in useless migrations (if I did
not misunderstand something), but with my reclaiming patches this is
probably the source of the strange effect you saw. But I am still
investigating this, so I am not too sure...

>> The real issue seems to be that at some point some runqueues have an
>> active utilisation = 1.33 (4 dl tasks in the runqueue), with other
>> runqueues only having 2 tasks... And this results in the huge imbalance
>> in utilisations you noticed. I am trying to understand why this
>> happens... It seems to me that a "pull_dl_task()" might end up pulling
>> more than 1 task... Is this possible?
>
> Yeah, this explain the numbers.
>
> Brainstorm time! (sorry if it sounds obviously unfeasible):
> Is it possible to think on GRUB tracking the global utilization?

Yes, and I even had a version of my patches using a "per root domain"
global active utilization. If needed I can update my patchset to
implement the global active utilization again.
I switched to per-runqueue active utilization because:
- this can be used for controlling the CPU frequency scaling... And
I've been told that frequency scaling is generally per-core / per-CPU
(but I need to verify this)
- the patches based on global active utilization needed to access this
global utilization in mutual exclusion, so I used a spinlock to
protect it... And I am not sure about scalability issues
- I suspect there were issues when the root domain / exclusive cpuset
is modified.


Thanks,
Luca