linux-kernel - Re: [RFC v4 0/6] CPU reclaiming for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170111121951.GI10415@e106622-lin>
Date:   Wed, 11 Jan 2017 12:19:51 +0000
From:   Juri Lelli <juri.lelli@....com>
To:     Luca Abeni <luca.abeni@...tn.it>
Cc:     Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Claudio Scordino <claudio@...dence.eu.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Tommaso Cucinotta <tommaso.cucinotta@...up.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

Hi,

On 04/01/17 19:30, Luca Abeni wrote:
> 2017-01-04 19:00 GMT+01:00, Daniel Bristot de Oliveira <bristot@...hat.com>:
> [...]
> >>>>> Some tasks start to use more CPU time, while others seems to use less
> >>>>> CPU than it was reserved for them. See the task 14926, it is using
> >>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
> >>>>
> >>>> What happened here is that some runqueues have an active utilisation
> >>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
> >>>> the tasks on those runqueues to consume less than 95%... This is the
> >>>> reason for the effect you noticed below:
> >>>
> >>> I see. But, AFAIK, the Linux's sched deadline measures the load
> >>> globally, not locally. So, it is not a problem having a load > than 95%
> >>> in the local queue if the global queue is < 95%.
> >>>
> >>> Am I missing something?
> >>
> >> The version of GRUB reclaiming implemented in my patches tracks a
> >> per-runqueue "active utilization", and uses it for reclaiming.
> >
> > I _think_ that this might be (one of) the source(s) of the problem...
> I agree that this can cause some problems, but I am not sure if it
> justifies the huge difference in utilisations you observed
> 
> > Just exercising...
> >
> > For example, with my taskset, with a hypothetical perfect balance of the
> > whole runqueue, one possible scenario is:
> >
> >    CPU    0    1     2     3
> > # TASKS   3    3     3     2
> >
> > In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
> > current task on these CPUs will have their runtime decreased by GRUB.
> > Meanwhile, the luck tasks in the CPU 3 would use an additional time that
> > they "globally" do not have - because the system, globally, has a load
> > higher than the 66.6...% of the local runqueue. Actually, part of the
> > time decreased from tasks on [0-2] are being used by the tasks on 3,
> > until the next migration of any task, which will change the luck
> > tasks... but without any guaranty that all tasks will be the luck one on
> > every activation, causing the problem.
> >
> > Does it make sense?
> 
> Yes; but my impression is that gEDF will migrate tasks so that the
> distribution of the reclaimed CPU bandwidth is almost uniform...
> Instead, you saw huge differences in the utilisations (and I do not
> think that "compressing" the utilisations from 100% to 95% can
> decrease the utilisation of a task from 33% to 25% / 26%... :)
>

I tried to replicate Daniel's experiment, but I don't see such a skewed
allocation. They get a reasonably uniform bandwidth and the trace
looks fairly good as well (all processes get to run on the different
processors at some time).

> I suspect there is something more going on here (might be some bug in
> one of my patches). I am trying to better understand what happened.
> 

However, playing with this a bit further, I found out one thing that
looks counter-intuitive (at least to me :).

Simplifying Daniel's example, let's say that we have one 10/30 task
running on a CPU with a 500/1000 global limit. Applying grub_reclaim()
formula we have:

 delta_exec = delta * (0.5 + 0.333) = delta * 0.833

Which in practice means that 1ms of real delta (at 1000HZ) corresponds
to 0.833ms of virtual delta. Considering this, a 10ms (over 30ms)
reservation gets "extended" to ~12ms (over 30ms), that is to say the
task consumes 0.4 of the CPU's bandwidth. top seems to back what I'm
saying, but am I still talking nonsense? :)

I was expecting that the task could consume 0.5 worth of bandwidth with
the given global limit. Is the current behaviour intended?

If we want to change this behaviour maybe something like the following
might work?

 delta_exec = (delta * to_ratio((1ULL << 20) - rq->dl.non_deadline_bw,
                                rq->dl.running_bw)) >> 20

The idea would be to normalize running_bw over the available dl_bw.

Thoughts?

Best,

- Juri