linux-kernel - Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9661a38e-3475-25c9-1c65-10b42b3888d6@oracle.com>
Date:   Thu, 27 Jun 2019 18:02:59 -0700
From:   Subhra Mazumdar <subhra.mazumdar@...cle.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, peterz@...radead.org,
        mingo@...hat.com, steven.sistare@...cle.com,
        dhaval.giani@...cle.com, daniel.lezcano@...aro.org,
        vincent.guittot@...aro.org, viresh.kumar@...aro.org,
        tim.c.chen@...ux.intel.com, mgorman@...hsingularity.net
Subject: Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT
 siblings


On 6/26/19 11:51 PM, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, subhra mazumdar wrote:
>
>> Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
>> This will be used for quick lookup in select_idle_cpu to determine the
>> limits of search.
> Why? The number of siblings is constant at least today unless you play
> silly cpu hotplug games. A bit more justification for adding yet another
> random storage would be appreciated.
Using cpumask_weight every time in select_idle_cpu to compute the no. of
SMT siblings can be costly as cpumask_weight may not be O(1) for systems
with large no. of CPUs (e.g 8 socket, each socket having lots of cores).
Over 512 CPUs the bitmask will span multiple cache lines and touching
multiple cache lines in the fast path of scheduler can cost more than we
save from this optimization. Even in single cache line it loops in longs.
We want to touch O(1) cache lines and do O(1) operations, hence
pre-compute it in per-CPU variable.
>
>> This patch does it only for x86.
> # grep 'This patch' Documentation/process/submitting-patches.rst
>
> IOW, we all know already that this is a patch and from the subject prefix
> and the diffstat it's pretty obvious that this is x86 only.
>
> So instead of documenting the obvious, please add proper context to justify
> the change.
Ok. The extra per-CPU optimization was done only for x86 as we cared about
it the most and make it future proof. I will add for other architectures.
>   
>> +/* representing number of HT siblings of each CPU */
>> +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
>> +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
> Why does this need an export? No module has any reason to access this.
I will remove it
>
>>   /* representing HT and core siblings of each logical CPU */
>>   DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
>>   EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>> @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
>>   
>>   	if (!has_mp) {
>>   		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
>> +		per_cpu(cpumask_weight_sibling, cpu) =
>> +		    cpumask_weight(topology_sibling_cpumask(cpu));
>>   		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
>>   		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
>>   		c->booted_cores = 1;
>> @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
>>   	for_each_cpu(i, cpu_sibling_setup_mask) {
>>   		o = &cpu_data(i);
>>   
>> -		if ((i == cpu) || (has_smt && match_smt(c, o)))
>> +		if ((i == cpu) || (has_smt && match_smt(c, o))) {
>>   			link_mask(topology_sibling_cpumask, cpu, i);
>> +			threads = cpumask_weight(topology_sibling_cpumask(cpu));
>> +			per_cpu(cpumask_weight_sibling, cpu) = threads;
>> +			per_cpu(cpumask_weight_sibling, i) = threads;
> This only works for SMT=2, but fails to update the rest for SMT=4.

I guess I assumed that x86 will always be SMT2, will fix this.

Thanks,
Subhra

>
>> @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
>>   
>>   	for_each_cpu(sibling, topology_core_cpumask(cpu)) {
>>   		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
>> +		per_cpu(cpumask_weight_sibling, sibling) =
>> +		    cpumask_weight(topology_sibling_cpumask(sibling));
> While remove does the right thing.
>
> Thanks,
>
> 	tglx