linux-kernel - Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtAmsX_iN8r6sT4z2Bgq8J8TqNu6BYvYUy3qf6QCf6cDyg@mail.gmail.com>
Date:   Tue, 18 Oct 2016 10:43:24 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Joseph Salisbury <joseph.salisbury@...onical.com>,
        Ingo Molnar <mingo@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Mike Galbraith <efault@....de>, omer.akram@...onical.com
Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

On 18 October 2016 at 00:52, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> On 10/17/2016 02:54 PM, Vincent Guittot wrote:
>> On 17 October 2016 at 15:19, Peter Zijlstra <peterz@...radead.org> wrote:
>>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote:
>
> [...]
>
>>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system
>>>> initially because systemd will create all ~100 services (and therefore the
>>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms
>>>> between the creation of 2 tg's so there was a lot of decaying going on in between.
>>>
>>> Cute... on current kernels that translates to simply removing the call
>>> to update_tg_load_avg(), lets see if we can figure out what goes
>>> sideways first though, because it _should_ decay back out. And if that
>>
>> yes, Reaching ~300000-400000 is not an issue in itself, the problem is
>> that load_avg has decayed but it has not been reflected in
>> tg->load_avg in the buggy case
>>
>>> can fail here, I'm not seeing why that wouldn't fail elsewhere either.
>>>
>>> I'll see if I can reproduce this with a script creating heaps of cgroups
>>> in a hurry, I have a total lack of system-disease on all my machines.
>

Hi Dietmar,

>
> Something looks weird related to the use of for_each_possible_cpu(i) in
> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>
> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
> I get:
>
> [ 5.462368]  cpu=0  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=1  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462370]  cpu=2  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462371]  cpu=3  cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>

Thanks to your description above, i have been able to reproduce the
issue on my ARM platform.
The key point is to have cpu_possible_mask different from
cpu_present_mask in order to reproduce the problem. When
cpu_present_mask equals cpu_possible_mask, i can't reproduce the
problem
I create a 1st level of task group tg-l1. Then each time, I create a
new task group in tg-l1, tg-l1.tg_load_avg will increase with 1024*
number of cpu that are possible but not present like you described
below

Thanks
Vincent

> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
> 80
>
> /proc/sched_debug:
>
> cfs_rq[0]:/system.slice
>   ...
>   .tg_load_avg                   : 323584
>   ...
>
> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
> this could be this extra load on the systen.slice tg)
>
> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
> of system.slice tg is 0 after startup.