linux-kernel - Re: [PATCH v3 3/3] sched: update blocked load when newly idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 12 Feb 2018 17:06:32 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Valentin Schneider <valentin.schneider@....com>,
        Morten Rasmussen <morten.rasmussen@...s.arm.com>,
        Brendan Jackman <brendan.jackman@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH v3 3/3] sched: update blocked load when newly idle

On 12 February 2018 at 16:38, Peter Zijlstra <peterz@...radead.org> wrote:
> On Mon, Feb 12, 2018 at 03:34:44PM +0100, Vincent Guittot wrote:
>> Le Monday 12 Feb 2018 à 13:04:11 (+0100), Peter Zijlstra a écrit :
>> > On Mon, Feb 12, 2018 at 09:07:54AM +0100, Vincent Guittot wrote:
>
>> > So I really hate this one, also I suspect its broken, because we do this
>> > check before dropping rq->lock and _nohz_idle_balance() will take
>> > rq->lock.
>>
>> yes. it will take both newly idle rq and idle rq lock
>
> Right, can't do that, there's ordering rules for multiple RQ locks etc..
>
>>
>> >
>> >
>> > Aside from the above being an unreadable mess, I dislike that it breaks
>> > the various isolation crud, we should not touch CPUs outside of our
>> > domain.
>> >
>> >
>> > Maybe something like the below? (unfinished)
>> >
>>
>> good catch. I completely miss the isolation stuff.
>> But isn't already the case when kicking ilb ? I mean that an idle CPU touches
>> all idle CPUs and some can be outside its domain during ilb.
>
>> Shouldn't we test housekeeping_cpu(cpu, HK_FLAG_SCHED) instead if we want to
>> make sure that an isolated/full nohz CPU will not be used for updating blocked
>> load of CPUs outside its domain ?
>
> I _thought_ we had some 'housekeeping' crud in the ilb selection logic,
> but now I can't find it. Frederic?
>
>> Is something below more readable:
>>
>>               /*
>> +              * This CPU doesn't want to be disturbed by scheduler
>> +              * houskeeping
>>                */
>> +             if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>> +                     goto out;
>> +
>> +             /* Will wake up very soon. No time for doing anything else*/
>> +             if (this_rq->avg_idle < sysctl_sched_migration_cost)
>> +                     goto out;
>> +
>> +             /* Don't need to update blocked load of idle CPUs*/
>> +             if (!has_blocked || time_after_eq(jiffies, next_blocked)
>> +                     goto out;
>> +
>> +             raw_spin_unlock(&this_rq->lock);
>> +             /*
>> +              * This CPU is going to be idle and blocked load of idle CPUs
>> +              * need to be updated. Run the ilb locally as it is a good
>> +              * candidate for ilb instead of waking up another idle CPU.
>> +              * Kick an normal ilb if we failed to do the update.
>> +              */
>> +             if !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>>                       kick_ilb(NOHZ_STATS_KICK);
>> +             raw_spin_lock(&this_rq->lock);
>>
>>               goto out;
>
> It is, but I think you're still doing that avg_idle thing twice now,
> right?

yes the goal was to try to not exceed idle time but I wonder if it is
really needed because the need_resched() in the
"for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
" will abort the loop if something is schedule on this_cpu just like
for a normal ilb().
So I think that we can remove this test with avg_idle.

>
>> > @@ -7850,7 +7850,7 @@ static bool update_nohz_stats(struct rq
>> >     if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
>> >             return false;
>> >
>> > -   if (!time_after(jiffies, rq->last_blocked_load_update_tick))
>> > +   if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
>>
>> This fix the concern raised on the other thread, isn't it ?
>
> Yes.
>
>> > +static int nohz_age(struct sched_domain *sd)
>> > +{
>> > +   struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> > +   bool has_blocked_load;
>> > +
>> > +   WRITE_ONCE(nohz.has_blocked, 0);
>> > +
>> > +   smp_mb();
>> > +
>> > +   cpumask_and(cpus, sched_domain_span(sd), nohz.idle_cpus_mask);
>> > +
>> > +   has_blocked_load = cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(sd));
>> > +
>> > +   for_each_cpu(cpu, cpus) {
>> > +           struct rq *rq = cpu_rq(cpu);
>> > +
>> > +           has_blocked_load |= update_nohz_stats(rq, true);
>> > +   }
>> > +
>> > +   if (has_blocked_load)
>> > +           WRITE_ONCE(nohz.has_blocked, 1);
>> > +}
>> > +
>>
>> we duplicate what is done in nohe_idle_balance
>
> In parts yes.. I was too lazy to combine :-)
>
>> > @@ -8919,9 +8955,13 @@ static int idle_balance(struct rq *this_
>> >             if (sd->flags & SD_BALANCE_NEWIDLE) {
>> >                     t0 = sched_clock_cpu(this_cpu);
>> >
>> > -                   pulled_task = load_balance(this_cpu, this_rq,
>> > -                                              sd, CPU_NEWLY_IDLE,
>> > -                                              &continue_balancing);
>> > +                   if (nohz_blocked) {
>> > +                           nohz_age(sd);
>>
>> Do we really need to loop all sched_domain of newly idle CPU and call
>> nohz_age for each level ?
>> Can't we only call  nohz_age with the widest/last sched_domain level ?
>
> Yeah, dunno. I went back and forth on that a bit. The largest is
> rq->rd->span. The reason I settled on this variant in the end is that it
> keeps locality. When short idle, it will only scan nearby CPUs instead
> of reaching half-way across the machine.
>
>> Furthermore, we use sd->max_newidle_lb_cost to decide to abort the loop.
>> But this is updated with full load balancing which is longer than just
>> updating blocked load.
>> This will increase the chance to abort before reaching the last level.
>
> Yes.. I figured we'd take that hit :/
>
>> > +                   } else {
>> > +                           pulled_task = load_balance(this_cpu, this_rq,
>> > +                                           sd, CPU_NEWLY_IDLE,
>> > +                                           &continue_balancing);
>> > +                   }
>> >
>> >                     domain_cost = sched_clock_cpu(this_cpu) - t0;
>> >                     if (domain_cost > sd->max_newidle_lb_cost)
>>
>> We have to kick an ilb if we must abort before looping all levels and all
>> idle CPUs otherwise we can have situation where the load of some idle CPus
>> could stay blocked
>
> Yes, like said, was unfinished, I gave up before I got to that.