linux-kernel - Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched domain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ab9866af-f88b-91d8-0e48-423dbcb00ae9@amd.com>
Date:   Fri, 11 Aug 2017 11:57:06 +0700
From:   Suravee Suthikulpanit <Suravee.Suthikulpanit@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, mingo@...hat.com, bp@...e.de
Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched
 domain

On 8/10/17 23:41, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
>> On AMD Family17h-based (EPYC) system, a NUMA node can contain
>> upto 8 cores (16 threads) with the following topology.
>>
>>              ----------------------------
>>          C0  | T0 T1 |    ||    | T0 T1 | C4
>>              --------|    ||    |--------
>>          C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>>              --------|    ||    |--------
>>          C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>>              --------|    ||    |--------
>>          C3  | T0 T1 |    ||    | T0 T1 | C7
>>              ----------------------------
>>
>> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
>> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
>> With full system configuration, current scheduler creates 4 sched
>> domains:
>>
>>   domain0 SMT       (span a core)
>>   domain1 MC        (span a last-level-cache)
>
> Right, so traditionally we'd have the DIE level do that, but because
> x86_has_numa_in_package we don't generate that, right?

That's correct.

>
>>   domain2 NUMA      (span a socket: 4 nodes)
>>   domain3 NUMA      (span a system: 8 nodes)
>>
>> Note that there is no domain to represent cpus spaning a NUMA node.
>> With this hierachy of sched domains, the scheduler does not balance
>> properly in the following cases:
>>
>> Case1:
>> When running 8 tasks, a properly balanced system should
>> schedule a task per NUMA node. This is not the case for
>> the current scheduler.
>>
>> Case2:
>> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
>> a properly balanced system should schedule 8 threads on 8 cpus
>> (e.g. T0 of C0-C7).  However, current scheduler would schedule
>> some threads on the same cpu, while others are idle.
>
> Sure.. could you amend with a few actual performance numbers?

Sure.

>> [...]
>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>>  		tl[i] = sched_domain_topology[i];
>>
>>  	/*
>> +	 * Ignore the NUMA identity level if it has the same cpumask
>> +	 * as previous level. This is the case for:
>> +	 *   - System with last-level-cache (MC) sched domain span a NUMA node.
>> +	 *   - System with DIE sched domain span a NUMA node.
>> +	 *
>> +	 * Assume all NUMA nodes are identical, so only check node 0.
>> +	 */
>> +	if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>> +		tl[i++] = (struct sched_domain_topology_level){
>> +			.mask = sd_numa_mask,
>> +			.numa_level = 0,
>> +			SD_INIT_NAME(NUMA_IDEN)
>
> Shall we make that:
>
> 			SD_INIT_NAME(NODE),
>
> instead?

Sounds good.

>> +		};
>
> This misses a set of '{}'. While C doesn't require it, out coding style
> warrants blocks around any multi-line statement.
>
> So what you've forgotten to mention is that for those systems where the
> LLC == NODE this now superfluous level gets removed by the degenerate
> code. Have you verified that does the right thing?

Let me check with that one and get back.

Thanks,
Suravee