linux-kernel - Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3e4d2594-f678-b77a-4883-0b893daf19f6@bytedance.com>
Date:   Mon, 27 Jun 2022 18:13:23 +0800
From:   Abel Wu <wuyun.abel@...edance.com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Josh Don <joshdon@...gle.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search


On 6/24/22 11:30 AM, Chen Yu Wrote:
>> ...
>>>> @@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>>>>    static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
>>>>    {
>>>> -	if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
>>>> -		set_idle_cpus(env->dst_cpu, true);
>>>> +	struct sched_domain_shared *sd_smt_shared = env->sd->shared;
>>>> +	enum sd_state new = sds->sd_state;
>>>> +	int this = env->dst_cpu;
>>>> +
>>>> +	/*
>>>> +	 * Parallel updating can hardly contribute accuracy to
>>>> +	 * the filter, besides it can be one of the burdens on
>>>> +	 * cache traffic.
>>>> +	 */
>>>> +	if (cmpxchg(&sd_smt_shared->updating, 0, 1))
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * There is at least one unoccupied cpu available, so
>>>> +	 * propagate it to the filter to avoid false negative
>>>> +	 * issue which could result in lost tracking of some
>>>> +	 * idle cpus thus throughupt downgraded.
>>>> +	 */
>>>> +	if (new != sd_is_busy) {
>>>> +		if (!test_idle_cpus(this))
>>>> +			set_idle_cpus(this, true);
>>>> +	} else {
>>>> +		/*
>>>> +		 * Nothing changes so nothing to update or
>>>> +		 * propagate.
>>>> +		 */
>>>> +		if (sd_smt_shared->state == sd_is_busy)
>>>> +			goto out;
>>>> +	}
>>>> +
>>>> +	sd_update_icpus(this, sds->idle_cpu);
>>> I wonder if we could further enhance it to facilitate idle CPU scan.
>>> For example, can we propagate the idle CPUs in smt domain, to its parent
>>> domain in a hierarchic sequence, and finally to the LLC domain. If there is
>>
>> In fact, it was my first try to cache the unoccupied cpus in SMT
>> shared domain, but the overhead of cpumask ops seems like a major
>> stumbling block.
>>
>>> a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
>>> could benefit from this mechanism.
>>> https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@hisilicon.com/
>>
>> Putting SIS into a hierarchical pattern is good for cache locality.
>> But I don't think multi-level filter is appropriate since it could
>> bring too much cache traffic in SIS,
> Could you please elaborate a little more about the cache traffic? I thought we
> don't save the unoccupied cpus in SMT shared domain, but to store it in middle
> layer shared domain, say, cluster->idle_cpus, this would reduce cache write
> contention compared to writing to llc->idle_cpus directly, because a smaller
> set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
> first, without having to query the llc->idle_cpus. This looks like splitting
> a big lock into fine grain small lock.

I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
with multiple cluster filters? Then I agree with what you suggested
that the contention would be reduced. But there are other concerns:

   a. Is it appropriate to fake an intermediate sched_domain if
      cluster level doesn't available? How to identify the proper
      size of the faked sched_domain?

   b. The SIS path might touch more cachelines (multiple cluster
      filters). I'm not sure how much is the impact.

Whatever, this seems worth a try. :)

>> and it could be expected to be
>> a disaster for netperf/tbench or the workloads suffering frequent
>> context switches.
>>
> So this overhead comes from the NEWLY_IDLE case?
> 

Yes, I think it's the main cause to rise the contention to new heights.
But it's also important to keep the filter fresh.

Thanks & BR,
Abel