linux-kernel - Re: [External] Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220624033032.GA14945@chenyu5-mobl1>
Date:   Fri, 24 Jun 2022 11:30:32 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Abel Wu <wuyun.abel@...edance.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Josh Don <joshdon@...gle.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [External] Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS
 search

On Wed, Jun 22, 2022 at 11:52:19AM +0800, Abel Wu wrote:
> Hi Chen, thanks for your comments!
> 
> On 6/22/22 2:14 AM, Chen Yu Wrote:
> > > ...
> > Reuse the data from load balance to select the unoccupied candidate
> > is applicable IMO, which is also aligned with SIS_UTIL path. And I have
> > a question regarding the update frequency. In v3 patch, the update is
> > based on periodic tick, which is triggered at most every 1ms(CONFIG_HZ_1000).
> > Then the periodic SMT load balance is launched every smt_weight ms, usually 2ms.
> > I expect the 2ms is of the same scale unit as 1ms, and since task tick based
> > update is acceptable, would excluding the CPU_NEWLY_IDLE balance from this patch
> > reduce the overhead meanwhile not bring too much inaccuracy?
> 
> I doubt periodic balancing entry is enough. The ms-scale could still
> be too large for wakeup path. Say one cpu becomes newly idle just after
> an update, then it keeps untouchable until next update which is nearly
> 2ms (even worse in SMT4/8 case) wasting time-slices to do nothing. So
> newly-idle update is important to keep the filter fresh. And the false
> positive correction is there to avoid excessive updates due to newly
> idle, by allowing the false positive cpus to stay in the filter for a
> little longer.

> 
> > > @@ -8757,7 +8794,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > >   		 * No need to call idle_cpu() if nr_running is not 0
> > >   		 */
> > >   		if (!nr_running && idle_cpu(i)) {
> > > +			/*
> > > +			 * Prefer the last idle cpu by overwriting
> > > +			 * preious one. The first idle cpu in this
> > > +			 * domain (if any) can trigger balancing
> > > +			 * and fed with tasks, so we'd better choose
> > > +			 * a candidate in an opposite way.
> > > +			 */
> > > +			sds->idle_cpu = i;
> > Does it mean, only 1 idle CPU in the smt domain would be set to the
> > idle cpu mask at one time? For SMT4/8 we might lose track of the
> > idle siblings.
> 
> Yes. The intention of one-at-a-time propagation is
> 1) help spreading out load to different cores
> 2) reduce some update overhead
> In this way, if the filter contains several cpus of a core, ideally
> we can sure about that at least one of them is actually unoccupied.
> For SMT4/8 we still have newly idle balance to make things right.
> 
> > >   			sgs->idle_cpus++;
> > > +
> > >   			/* Idle cpu can't have misfit task */
> > >   			continue;
> > >   		}
> > > @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > >   static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
> > >   {
> > > -	if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
> > > -		set_idle_cpus(env->dst_cpu, true);
> > > +	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
> > > +	enum sd_state new = sds->sd_state;
> > > +	int this = env->dst_cpu;
> > > +
> > > +	/*
> > > +	 * Parallel updating can hardly contribute accuracy to
> > > +	 * the filter, besides it can be one of the burdens on
> > > +	 * cache traffic.
> > > +	 */
> > > +	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
> > > +		return;
> > > +
> > > +	/*
> > > +	 * There is at least one unoccupied cpu available, so
> > > +	 * propagate it to the filter to avoid false negative
> > > +	 * issue which could result in lost tracking of some
> > > +	 * idle cpus thus throughupt downgraded.
> > > +	 */
> > > +	if (new != sd_is_busy) {
> > > +		if (!test_idle_cpus(this))
> > > +			set_idle_cpus(this, true);
> > > +	} else {
> > > +		/*
> > > +		 * Nothing changes so nothing to update or
> > > +		 * propagate.
> > > +		 */
> > > +		if (sd_smt_shared->state == sd_is_busy)
> > > +			goto out;
> > > +	}
> > > +
> > > +	sd_update_icpus(this, sds->idle_cpu);
> > I wonder if we could further enhance it to facilitate idle CPU scan.
> > For example, can we propagate the idle CPUs in smt domain, to its parent
> > domain in a hierarchic sequence, and finally to the LLC domain. If there is
> 
> In fact, it was my first try to cache the unoccupied cpus in SMT
> shared domain, but the overhead of cpumask ops seems like a major
> stumbling block.
> 
> > a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
> > could benefit from this mechanism.
> > https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@hisilicon.com/
> 
> Putting SIS into a hierarchical pattern is good for cache locality.
> But I don't think multi-level filter is appropriate since it could
> bring too much cache traffic in SIS, 
Could you please elaborate a little more about the cache traffic? I thought we
don't save the unoccupied cpus in SMT shared domain, but to store it in middle
layer shared domain, say, cluster->idle_cpus, this would reduce cache write
contention compared to writing to llc->idle_cpus directly, because a smaller
set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
first, without having to query the llc->idle_cpus. This looks like splitting
a big lock into fine grain small lock.
> and it could be expected to be
> a disaster for netperf/tbench or the workloads suffering frequent
> context switches.
>
So this overhead comes from the NEWLY_IDLE case?

thanks,
Chenyu
> > 
> > Furthermore, even if there is no cluster domain, would a 'virtual' middle
> > sched domain would help reduce the contention?
> > Core0(CPU0,CPU1),Core1(CPU2,CPU3) Core2(CPU4,CPU5) Core3(CPU6,CPU7)
> > We can create cpumask1, which is composed of Core0 and Core1, and cpumask2
> > which is composed of Core2 and Core3. The SIS would first scan in cpumask1,
> > if idle cpu is not found, scan cpumask2. In this way, the CPUs in Core0 and
> > Core1 only updates cpumask1, without competing with Core2 and Core3 on cpumask2.
> > 
> Yes, this is the best case, but the worst case is something that
> we probably can't afford.
> 
> Thanks & BR,
> Abel