linux-kernel - Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtBbQ7HXetgSfEyFYyCNypYcRuSd_=H7xVs5P6LcTthGfQ@mail.gmail.com>
Date:   Tue, 30 Jan 2018 09:00:34 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Morten Rasmussen <morten.rasmussen@...s.arm.com>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Brendan Jackman <brendan.jackman@....com>,
        Morten Rasmussen <morten.rasmussen@....com>
Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

On 29 January 2018 at 19:43, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
> On 01/24/2018 09:25 AM, Vincent Guittot wrote:
>>
>> Hi,
>>
>> Le Thursday 18 Jan 2018 à 10:38:07 (+0000), Morten Rasmussen a écrit :
>>>
>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>>>>
>>>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>
>
> [...]
>
>
>>>>
>>>> Hi Peter,
>>>>
>>>> With the patch below on top of your branch, the blocked loads are
>>>> updated and
>>>> decayed regularly. The main differences are:
>>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>>>> idle.
>>>>    The main drawback of this solution is that the load is blocked when
>>>> the
>>>>    system is fully idle with the advantage of not waking up a fully idle
>>>>    system. We have to wait for the next tick or newly idle event for
>>>> updating
>>>>    blocked load when the system leaves idle stat which can be up to a
>>>> tick long.
>>>>    If this is too long, we can check for kicking ilb when task wakes up
>>>> so the
>>>>    blocked load will be updated as soon as the system leaves idle state.
>>>>    The main advantage is that we don't wake up a fully idle system every
>>>> 32ms to
>>>>    update blocked load that will be not used.
>>>> - I'm working on one more improvement to use nohz_idle_balance in the
>>>> newly
>>>>    idle case when the system is not overloaded and
>>>>    (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
>>>> can try to
>>>>    use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>>>> exceed
>>>>    this_rq->avg_idle. This will remove some calls to kick_ilb and some
>>>> wake up
>>>>    of an idle cpus.
>>>
>>>
>>> This sound like what I meant in my other reply :-)
>>>
>>> It seems pointless to have a timer to update PELT if the system is
>>> completely idle, and when it isn't we can piggy back other events to
>>> make the updates happen.
>>
>>
>> The patch below implements what has been described above. It calls part of
>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>> much
>> time. This removes part of ilb that are kicked on an idle cpu for updating
>> the blocked load but the ratio really depends on when the tick happens
>> compared
>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>> that
>> enables to update the blocked loads when a cpu becomes idle 1 period
>> before
>> kicking an ilb and there is far less ilb because we give more chance to
>> the
>> newly idle case (time_after is replaced by time_after_eq in
>> idle_balance()).
>>
>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>> This
>> reduce significantly the number of update of blocked load. the *_avg will
>> be
>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>> have
>> a higher resolution and we can easily reach almost seconds. But only the
>> *_avg
>> are used to make decision so keeping some blocked *_sum is acceptable.
>>
>> ---
>>   kernel/sched/fair.c | 121
>> +++++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 92 insertions(+), 29 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 898785d..ed90303 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>> *cfs_rq)
>>         return true;
>>   }
>>   +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>> +{
>> +       if (cfs_rq->avg.load_avg)
>> +               return true;
>> +
>> +       if (cfs_rq->avg.util_avg)
>> +               return true;
>> +
>> +       return false;
>> +}
>> +
>
>
> Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of
> avg.foo_sum ?

I don't think so because the *_sum are used to keep coherency bewteen
cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe
can't remove it until the *_sum are null otherwise the all cfs_rq and
group will be out of sync

>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>     static void update_blocked_averages(int cpu)
>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>                  */
>>                 if (cfs_rq_is_decayed(cfs_rq))
>>                         list_del_leaf_cfs_rq(cfs_rq);
>> -               else
>> +
>> +               /* Don't need periodic decay once load/util_avg are null
>> */
>> +               if (cfs_rq_has_blocked(cfs_rq))
>>                         done = false;
>>         }
>>   @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>> cpu)
>>         update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>   #ifdef CONFIG_NO_HZ_COMMON
>>         rq->last_blocked_load_update_tick = jiffies;
>> -       if (cfs_rq_is_decayed(cfs_rq))
>> +       if (cfs_rq_has_blocked(cfs_rq))
>
>
> Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ?

yes. I copy/pasted too quickly from sched_group_fair to not sched_group_fair

>
>>                 rq->has_blocked_load = 0;
>>   #endif
>>         rq_unlock_irqrestore(rq, &rf);
>
>
> [...]
>
>
>> @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>>                  */
>>                 if (need_resched()) {
>>                         has_blocked_load = true;
>> -                       break;
>> +                       goto abort;
>> +               }
>> +
>> +               /*
>> +                * If the update is done while CPU becomes idle, we abort
>> +                * the update when its cost is higher than the average
>> idle
>> +                * time in orde to not delay a possible wake up.
>> +                */
>> +               if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle <
>> curr_cost) {
>> +                       has_blocked_load = true;
>> +                       goto abort;
>>                 }
>>                 rq = cpu_rq(balance_cpu);
>> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>>                 if (time_after_eq(jiffies, rq->next_balance)) {
>>                         struct rq_flags rf;
>>   -                     rq_lock_irq(rq, &rf);
>> +                       rq_lock_irqsave(rq, &rf);
>>                         update_rq_clock(rq);
>>                         cpu_load_update_idle(rq);
>> -                       rq_unlock_irq(rq, &rf);
>> +                       rq_unlock_irqrestore(rq, &rf);
>>                         if (flags & NOHZ_BALANCE_KICK)
>>                                 rebalance_domains(rq, CPU_IDLE);
>> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>> enum cpu_idle_type idle)
>>                         next_balance = rq->next_balance;
>>                         update_next_balance = 1;
>>                 }
>
>
> Why do you do this cpu_load_update_idle(rq) even this was called with
> CPU_NEWLY_IDLE? Wouldn't it be sufficient to jump to the curr_cost
> calculation in this case?

just to keep thing similar to what happen with kick_ilb and that's an
occasion to also update the cpu_load

>
>> +
>> +               domain_cost = sched_clock_cpu(this_cpu) - t0;
>> +               curr_cost += domain_cost;
>> +
>>         }
>>   -     update_blocked_averages(this_cpu);
>> -       has_blocked_load |= this_rq->has_blocked_load;
>> +       /* Newly idle CPU doesn't need an update */
>> +       if (idle != CPU_NEWLY_IDLE) {
>> +               update_blocked_averages(this_cpu);
>> +               has_blocked_load |= this_rq->has_blocked_load;
>> +       }
>>         if (flags & NOHZ_BALANCE_KICK)
>>                 rebalance_domains(this_rq, CPU_IDLE);
>
>
> [...]