[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab9866af-f88b-91d8-0e48-423dbcb00ae9@amd.com>
Date: Fri, 11 Aug 2017 11:57:06 +0700
From: Suravee Suthikulpanit <Suravee.Suthikulpanit@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...hat.com, bp@...e.de
Subject: Re: [RFC PATCH] sched/topology: Introduce NUMA identity node sched
domain
On 8/10/17 23:41, Peter Zijlstra wrote:
> On Thu, Aug 10, 2017 at 10:20:52AM -0500, Suravee Suthikulpanit wrote:
>> On AMD Family17h-based (EPYC) system, a NUMA node can contain
>> upto 8 cores (16 threads) with the following topology.
>>
>> ----------------------------
>> C0 | T0 T1 | || | T0 T1 | C4
>> --------| || |--------
>> C1 | T0 T1 | L3 || L3 | T0 T1 | C5
>> --------| || |--------
>> C2 | T0 T1 | #0 || #1 | T0 T1 | C6
>> --------| || |--------
>> C3 | T0 T1 | || | T0 T1 | C7
>> ----------------------------
>>
>> Here, there are 2 last-level (L3) caches per NUMA node. A socket can
>> contain upto 4 NUMA nodes, and a system can support upto 2 sockets.
>> With full system configuration, current scheduler creates 4 sched
>> domains:
>>
>> domain0 SMT (span a core)
>> domain1 MC (span a last-level-cache)
>
> Right, so traditionally we'd have the DIE level do that, but because
> x86_has_numa_in_package we don't generate that, right?
That's correct.
>
>> domain2 NUMA (span a socket: 4 nodes)
>> domain3 NUMA (span a system: 8 nodes)
>>
>> Note that there is no domain to represent cpus spaning a NUMA node.
>> With this hierachy of sched domains, the scheduler does not balance
>> properly in the following cases:
>>
>> Case1:
>> When running 8 tasks, a properly balanced system should
>> schedule a task per NUMA node. This is not the case for
>> the current scheduler.
>>
>> Case2:
>> When running 'taskset -c 0-7 <a_program_with_8_independent_threads>',
>> a properly balanced system should schedule 8 threads on 8 cpus
>> (e.g. T0 of C0-C7). However, current scheduler would schedule
>> some threads on the same cpu, while others are idle.
>
> Sure.. could you amend with a few actual performance numbers?
Sure.
>> [...]
>> @@ -1445,9 +1448,24 @@ void sched_init_numa(void)
>> tl[i] = sched_domain_topology[i];
>>
>> /*
>> + * Ignore the NUMA identity level if it has the same cpumask
>> + * as previous level. This is the case for:
>> + * - System with last-level-cache (MC) sched domain span a NUMA node.
>> + * - System with DIE sched domain span a NUMA node.
>> + *
>> + * Assume all NUMA nodes are identical, so only check node 0.
>> + */
>> + if (!cpumask_equal(sched_domains_numa_masks[0][0], tl[i-1].mask(0)))
>> + tl[i++] = (struct sched_domain_topology_level){
>> + .mask = sd_numa_mask,
>> + .numa_level = 0,
>> + SD_INIT_NAME(NUMA_IDEN)
>
> Shall we make that:
>
> SD_INIT_NAME(NODE),
>
> instead?
Sounds good.
>> + };
>
> This misses a set of '{}'. While C doesn't require it, out coding style
> warrants blocks around any multi-line statement.
>
> So what you've forgotten to mention is that for those systems where the
> LLC == NODE this now superfluous level gets removed by the degenerate
> code. Have you verified that does the right thing?
Let me check with that one and get back.
Thanks,
Suravee
Powered by blists - more mailing lists