linux-kernel - Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 30 Jan 2018 11:41:34 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Morten Rasmussen <morten.rasmussen@...s.arm.com>,
        Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Brendan Jackman <brendan.jackman@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>
Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK

(Resending because I snuck in some HTML... Apologies)

On 01/30/2018 08:32 AM, Vincent Guittot wrote:
> On 29 January 2018 at 20:31, Valentin Schneider
> <valentin.schneider@....com> wrote:
>> Hi Vincent, Peter,
>>
>> I've been running some tests on your patches (Peter's base + the 2 from
>> Vincent). The results themselves are hosted at [1].
>> The base of those tests is the same: a task ("accumulator") is ran for 5
>> seconds (arbitrary value) to accumulate some load, then goes to sleep for .5
>> seconds.
>>
>> I've set up 3 test scenarios:
>>
>> Update by nohz_balance_kick()
>> -----------------------------
>> Right before the "accumulator" task goes to sleep, a CPU-hogging task (100%
>> utilization) is spawned on another CPU. It won't go idle so the only way to
>> update the blocked load generated by "accumulator" is to kick an ILB
>> (NOHZ_STATS_KICK).
>>
>> The test shows that this is behaving nicely - we keep kicking an ILB every
>> ~36ms (see next test for comments on that) until there is no more blocked
>> load. I did however notice some interesting scenarios: after the load has
>> been fully decayed, a tiny background task can spawn and end in less than a
>> scheduling period. However, it still goes through nohz_balance_enter_idle(),
>> and thus sets nohz.stats_state, which will later cause an ILB kick.
>>
>> This makes me wonder if it's worth kicking ILBs for such tiny load values -
>> perhaps it could be worth having a margin to set rq->has_blocked_load ?
> 
> So it's difficult to know what will be the load/utilization on the
> cfs_rq once the cpu wakes up. Even if it's for a really short time,
> that's doesn't mean that the load/utilization is small because it can
> be the migration of a big task that just have a very short wakes up
> this time.
> That's why I don't make any assumption on the utilization/load value
> when a cpu goes to sleep
> 

Right, hadn't thought about those kind of migrations.

>>
>> Furthermore, this tiny task will cause the ILB to iterate over all of the
>> idle CPUs, although only one has stale load. For load update via NEWLY_IDLE
>> load_balance() we use:
>>
>> static bool update_nohz_stats(struct rq *rq)
>> {
>>      if (!rq->has_blocked_load)
>>       return false;
>>      [...]
>> }
>>
>> But for load update via _nohz_idle_balance(), we iterate through all of the
>> nohz CPUS and unconditionally call update_blocked_averages(). This could be
>> avoided by remembering which CPUs have stale load before going idle.
>> Initially I thought that was what nohz.stats_state was for, but it isn't.
>> With Vincent's patches it's only ever set to either 0 or 1, but we could use
>> it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in
>> _nohz_idle_balance() (when NOHZ_STATS_KICK).
> 
> I have studied a way to keep track of how many cpus still have blocked
> load to try to minimize the number of useless ilb kick but this add
> more atomic operations which can impact the system throughput with
> heavy load and lot of very small wake up. that's why i have propose
> this solution which is more simple. But it's probably just a matter of
> where we want to "waste" time. Either we accept to spent a bit more
> time to check the state of idle CPUs or we accept to kick ilb from
> time to time for no good reason.
> 

Agreed. I have the feeling that spending more time doing atomic ops 
could be worth it - I'll try to test this out and see if it's actually 
relevant.

>>
>> Update by idle_balance()
>> ------------------------
>> Right before the "accumulator" task goes to sleep, a tiny periodic
>> (period=32ms) task is spawned on another CPU. It's expected that it will
>> update the blocked load in idle_balance(), either by running
>> _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't
>> be set in this test case, so we shouldn't go through the NEWLY_IDLE
>> load_balance()).
>>
>> This also seems to be working fine, but I'm noticing a delay between load
>> updates that is closer to 64ms than 32ms. After digging into it I found out
>> that the time checks done in idle_balance() and nohz_balancer_kick() are
>> time_after(jiffies, next_stats), but IMHO they should be
>> time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also
>> explains the 36ms periodicity of the updates in the test above.
> 
> I have use the 32ms as a minimum value between update. We must use the
> time_after()  if we want to have at least 32ms between each update. We
> will have a 36ms period if the previous update was triggered by the
> tick (just after in fact) but there will be only 32ms if the last
> update was done during an idle_balance that happens just before the
> tick. With  time_after_eq,  the update period will between 28 and
> 32ms.
> 
> Then, I mention a possible optimization by using time_after_eq in the
> idle_balance() so a newly_idle cpu will have more chance (between 0
> and 4ms for hz250) to do the update before a ilb is kicked
> 

IIUC with time_after() the update period should be within ]32, 36] ms, 
but it looks like I'm always on that upper bound in my tests.

When evaluating whether we need to kick_ilb() for load updates, we'll 
always be right after the tick (excluding the case in idle_balance), 
which explains why we wait for an extra tick in the "update by 
nohz_balancer_kick()" test case.

The tricky part is that, as you say, the update by idle_balance() can 
happen anywhere between [0-4[ ms after a tick (or before, depending on 
how you see it), so using time_after_eq could make the update period < 
32ms - and this also impacts a load update by nohz_balance_kick() if the 
previous update was done by idle_balance()... This is what causes the 
update period to be closer to 64ms in my test case, but it's somewhat 
artificial because I only have a 32ms-periodic task running - if there 
was any other task running the period could remain in that ]32, 36] ms 
interval.

Did I get that right ?

> Thanks,
> Vincent
> 
>>
>>
>> No update (idle system)
>> -----------------------
>> Nothing special here, just making sure nothing happens when the system is
>> fully idle. On a sidenote, that's relatively hard to achieve - I had to
>> switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno
>> still gets woken up every now and then but it's a bit quieter.
>>
>>
>> [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0
>>
>>
>>
>> On 01/24/2018 08:25 AM, Vincent Guittot wrote:
>>>
>>> Hi,
>>>
>>> Le Thursday 18 Jan 2018 à 10:38:07 (+0000), Morten Rasmussen a écrit :
>>>>
>>>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>>>>>
>>>>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 22 December 2017 at 21:42, Peter Zijlstra <peterz@...radead.org>
>>>>>> wrote:
>>>>>>>
>>>>>>> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>>>>>>>>
>>>>>>>> Right; but I figured we'd try and do it 'right' and see how horrible
>>>>>>>> it
>>>>>>>> is before we try and do funny things.
>>>>>>>
>>>>>>> So now it should have a 32ms tick for up to .5s when the system goes
>>>>>>> completely idle.
>>>>>>>
>>>>>>> No idea how bad that is..
>>>>>>
>>>>>> I have tested your branch but the timer doesn't seem to fire correctly
>>>>>> because i can still see blocked load in the use case i have run.
>>>>>> I haven't found the reason yet
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> With the patch below on top of your branch, the blocked loads are
>>>>> updated and
>>>>> decayed regularly. The main differences are:
>>>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes
>>>>> idle.
>>>>>     The main drawback of this solution is that the load is blocked when
>>>>> the
>>>>>     system is fully idle with the advantage of not waking up a fully idle
>>>>>     system. We have to wait for the next tick or newly idle event for
>>>>> updating
>>>>>     blocked load when the system leaves idle stat which can be up to a
>>>>> tick long.
>>>>>     If this is too long, we can check for kicking ilb when task wakes up
>>>>> so the
>>>>>     blocked load will be updated as soon as the system leaves idle state.
>>>>>     The main advantage is that we don't wake up a fully idle system every
>>>>> 32ms to
>>>>>     update blocked load that will be not used.
>>>>> - I'm working on one more improvement to use nohz_idle_balance in the
>>>>> newly
>>>>>     idle case when the system is not overloaded and
>>>>>     (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we
>>>>> can try to
>>>>>     use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it
>>>>> exceed
>>>>>     this_rq->avg_idle. This will remove some calls to kick_ilb and some
>>>>> wake up
>>>>>     of an idle cpus.
>>>>
>>>> This sound like what I meant in my other reply :-)
>>>>
>>>> It seems pointless to have a timer to update PELT if the system is
>>>> completely idle, and when it isn't we can piggy back other events to
>>>> make the updates happen.
>>>
>>> The patch below implements what has been described above. It calls part of
>>> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too
>>> much
>>> time. This removes part of ilb that are kicked on an idle cpu for updating
>>> the blocked load but the ratio really depends on when the tick happens
>>> compared
>>> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch
>>> that
>>> enables to update the blocked loads when a cpu becomes idle 1 period
>>> before
>>> kicking an ilb and there is far less ilb because we give more chance to
>>> the
>>> newly idle case (time_after is replaced by time_after_eq in
>>> idle_balance()).
>>>
>>> The patch also uses a function cfs_rq_has_blocked, which only checks the
>>> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too.
>>> This
>>> reduce significantly the number of update of blocked load. the *_avg will
>>> be
>>> fully decayed in around 300~400ms but it's far longer for the *_sum which
>>> have
>>> a higher resolution and we can easily reach almost seconds. But only the
>>> *_avg
>>> are used to make decision so keeping some blocked *_sum is acceptable.
>>>
>>> ---
>>>    kernel/sched/fair.c | 121
>>> +++++++++++++++++++++++++++++++++++++++-------------
>>>    1 file changed, 92 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 898785d..ed90303 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq
>>> *cfs_rq)
>>>          return true;
>>>    }
>>>    +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
>>> +{
>>> +       if (cfs_rq->avg.load_avg)
>>> +               return true;
>>> +
>>> +       if (cfs_rq->avg.util_avg)
>>> +               return true;
>>> +
>>> +       return false;
>>> +}
>>> +
>>>    #ifdef CONFIG_FAIR_GROUP_SCHED
>>>      static void update_blocked_averages(int cpu)
>>> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>>>                   */
>>>                  if (cfs_rq_is_decayed(cfs_rq))
>>>                          list_del_leaf_cfs_rq(cfs_rq);
>>> -               else
>>> +
>>> +               /* Don't need periodic decay once load/util_avg are null
>>> */
>>> +               if (cfs_rq_has_blocked(cfs_rq))
>>>                          done = false;
>>>          }
>>>    @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int
>>> cpu)
>>>          update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>>>    #ifdef CONFIG_NO_HZ_COMMON
>>>          rq->last_blocked_load_update_tick = jiffies;
>>> -       if (cfs_rq_is_decayed(cfs_rq))
>>> +       if (cfs_rq_has_blocked(cfs_rq))
>>>                  rq->has_blocked_load = 0;
>>>    #endif
>>>          rq_unlock_irqrestore(rq, &rf);
>>> @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd,
>>> unsigned long *next_balance)
>>>                  *next_balance = next;
>>>    }
>>>    +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>>> enum cpu_idle_type idle);
>>>    static void kick_ilb(unsigned int flags);
>>>      /*
>>> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct
>>> rq_flags *rf)
>>>                          update_next_balance(sd, &next_balance);
>>>                  rcu_read_unlock();
>>>    -             if (time_after(jiffies, next) &&
>>> atomic_read(&nohz.stats_state))
>>> +               /*
>>> +                * Update blocked idle load if it has not been done for a
>>> +                * while. Try to do it locally before entering idle but
>>> kick a
>>> +                * ilb if it takes too much time and might delay next
>>> local
>>> +                * wake up
>>> +                */
>>> +               if (time_after(jiffies, next) &&
>>> atomic_read(&nohz.stats_state) &&
>>> +                               !_nohz_idle_balance(this_rq,
>>> NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>>>                          kick_ilb(NOHZ_STATS_KICK);
>>>                  goto out;
>>> @@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu)
>>>          if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>>>                  return;
>>>    +     rq->has_blocked_load = 1;
>>>          if (rq->nohz_tick_stopped)
>>>                  return;
>>>    @@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu)
>>>                  return;
>>>          rq->nohz_tick_stopped = 1;
>>> -       rq->has_blocked_load = 1;
>>>          cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>>>          atomic_inc(&nohz.nr_cpus);
>>> @@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu)
>>>           * enable the periodic update of the load of idle cpus
>>>           */
>>>          atomic_set(&nohz.stats_state, 1);
>>> -
>>>    }
>>>    #else
>>>    static inline void nohz_balancer_kick(struct rq *rq) { }
>>> @@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum
>>> cpu_idle_type idle)
>>>      #ifdef CONFIG_NO_HZ_COMMON
>>>    /*
>>> - * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>>> - * rebalancing for all the cpus for whom scheduler ticks are stopped.
>>> + * Internal function that runs load balance for all idle cpus. The load
>>> balance
>>> + * can be a simple update of blocked load or a complete load balance with
>>> + * tasks movement depending of flags.
>>> + * For newly idle mode, we abort the loop if it takes too much time and
>>> return
>>> + * false to notify that the loop has not be completed and a ilb shoud be
>>> kick.
>>>     */
>>> -static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
>>> idle)
>>> +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
>>> enum cpu_idle_type idle)
>>>    {
>>>          /* Earliest time when we have to do rebalance again */
>>>          unsigned long now = jiffies;
>>> @@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>          bool has_blocked_load = false;
>>>          int update_next_balance = 0;
>>>          int this_cpu = this_rq->cpu;
>>> -       unsigned int flags;
>>>          int balance_cpu;
>>> +       int ret = false;
>>>          struct rq *rq;
>>> -
>>> -       if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>>> -               return false;
>>> -
>>> -       if (idle != CPU_IDLE) {
>>> -               atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>>> -               return false;
>>> -       }
>>> -
>>> -       /*
>>> -        * barrier, pairs with nohz_balance_enter_idle(), ensures ...
>>> -        */
>>> -       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>>> -       if (!(flags & NOHZ_KICK_MASK))
>>> -               return false;
>>> +       u64 curr_cost = 0;
>>>          SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>>>    @@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>          atomic_set(&nohz.stats_state, 0);
>>>          for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
>>> +               u64 t0, domain_cost;
>>> +
>>> +               t0 = sched_clock_cpu(this_cpu);
>>> +
>>>                  if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>>>                          continue;
>>>    @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>                   */
>>>                  if (need_resched()) {
>>>                          has_blocked_load = true;
>>> -                       break;
>>> +                       goto abort;
>>> +               }
>>> +
>>> +               /*
>>> +                * If the update is done while CPU becomes idle, we abort
>>> +                * the update when its cost is higher than the average
>>> idle
>>> +                * time in orde to not delay a possible wake up.
>>> +                */
>>> +               if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle <
>>> curr_cost) {
>>> +                       has_blocked_load = true;
>>> +                       goto abort;
>>>                  }
>>>                  rq = cpu_rq(balance_cpu);
>>> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>                  if (time_after_eq(jiffies, rq->next_balance)) {
>>>                          struct rq_flags rf;
>>>    -                     rq_lock_irq(rq, &rf);
>>> +                       rq_lock_irqsave(rq, &rf);
>>>                          update_rq_clock(rq);
>>>                          cpu_load_update_idle(rq);
>>> -                       rq_unlock_irq(rq, &rf);
>>> +                       rq_unlock_irqrestore(rq, &rf);
>>>                          if (flags & NOHZ_BALANCE_KICK)
>>>                                  rebalance_domains(rq, CPU_IDLE);
>>> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>                          next_balance = rq->next_balance;
>>>                          update_next_balance = 1;
>>>                  }
>>> +
>>> +               domain_cost = sched_clock_cpu(this_cpu) - t0;
>>> +               curr_cost += domain_cost;
>>> +
>>>          }
>>>    -     update_blocked_averages(this_cpu);
>>> -       has_blocked_load |= this_rq->has_blocked_load;
>>> +       /* Newly idle CPU doesn't need an update */
>>> +       if (idle != CPU_NEWLY_IDLE) {
>>> +               update_blocked_averages(this_cpu);
>>> +               has_blocked_load |= this_rq->has_blocked_load;
>>> +       }
>>>          if (flags & NOHZ_BALANCE_KICK)
>>>                  rebalance_domains(this_rq, CPU_IDLE);
>>> @@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>          WRITE_ONCE(nohz.next_stats,
>>>                  now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>>>    +     /* The full idle balance loop has been done */
>>> +       ret = true;
>>> +
>>> +abort:
>>>          /* There is still blocked load, enable periodic update */
>>>          if (has_blocked_load)
>>>                  atomic_set(&nohz.stats_state, 1);
>>> @@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq,
>>> enum cpu_idle_type idle)
>>>          if (likely(update_next_balance))
>>>                  nohz.next_balance = next_balance;
>>>    +     return ret;
>>> +}
>>> +
>>> +/*
>>> + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
>>> + * rebalancing for all the cpus for whom scheduler ticks are stopped.
>>> + */
>>> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type
>>> idle)
>>> +{
>>> +       int this_cpu = this_rq->cpu;
>>> +       unsigned int flags;
>>> +
>>> +       if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
>>> +               return false;
>>> +
>>> +       if (idle != CPU_IDLE) {
>>> +               atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>>> +               return false;
>>> +       }
>>> +
>>> +       /*
>>> +        * barrier, pairs with nohz_balance_enter_idle(), ensures ...
>>> +        */
>>> +       flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
>>> +       if (!(flags & NOHZ_KICK_MASK))
>>> +               return false;
>>> +
>>> +       _nohz_idle_balance(this_rq, flags, idle);
>>> +
>>>          return true;
>>>    }
>>>    #else
>>
>>