linux-kernel - Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 28 Jun 2017 03:26:10 +0700
From:   Suravee Suthikulpanit <Suravee.Suthikulpanit@....com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org, leo.duran@....com,
        yazen.ghannam@....com, Peter Zijlstra <peterz@...radead.org>,
        "Lendacky, Thomas" <Thomas.Lendacky@....com>
Subject: Re: [PATCH 1/2] x86/CPU/AMD: Present package as die instead of socket

On 6/28/17 00:44, Borislav Petkov wrote:
> So let's try to discuss this without using DIE sched-domain, CCX, etc,
> and let's start simple.
>
> So in that die graphic:
>
>               ----------------------------
>           C0  | T0 T1 |    ||    | T0 T1 | C4
>               --------|    ||    |--------
>           C1  | T0 T1 | L3 || L3 | T0 T1 | C5
>               --------|    ||    |--------
>           C2  | T0 T1 | #0 || #1 | T0 T1 | C6
>               --------|    ||    |--------
>           C3  | T0 T1 |    ||    | T0 T1 | C7
>               ----------------------------
>
> you want all those threads to belong to a single scheduling group.
> Correct?

Actually, let's be a bit more specific here since the meaning of sched-group and 
sched-domain are different where:

(From: Documentation/scheduler/sched-domains.txt)
                      ---- begin snippet ----
Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.

Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.
                       ---- end snippet ----

So, from the definition above, we would like all those 16 threads to be in the 
same sched-domain, where threads from C0,1,2,3 are in the same sched-group, and 
threads in C4,5,6,7 are in another sched-group.

> Now that thing has a memory controller attached to it, correct?

Yes

> If so, why is this thing not a logical NUMA node, as described in
> SRAT/SLIT?

Yes, this thing is a logical NUMA node and represented correctly in the SRAT/SLIT.

> Now, SRAT should contain the assignment which core belongs to which
> node. Why is that not sufficient?

Yes, SRAT provides cpu-to-node mapping, which is sufficient to tell scheduler 
what are the cpus within a NUMA node.

However, looking at the current sched-domain below. Notice that there is no 
sched-domain with 16 threads to represent a NUMA node:

cpu0
domain0 00000000,00000001,00000000,00000001 (SMT)
domain1 00000000,0000000f,00000000,0000000f (MC)
domain2 00000000,ffffffff,00000000,ffffffff (NUMA)
domain3 ffffffff,ffffffff,ffffffff,ffffffff (NUMA)

sched-domain2 (which represents a sched-domain containing all cpus within a 
socket) would have 8 sched-groups (based on the cpumasks from domain1). 
According to the documentation snippet above regarding balancing within a 
sched-domain, scheduler will try to do (NUMA) load-balance between 8 groups 
(spanning 4 NUMA node). Here, IINM, it would be more beneficial if the scheduler 
would try to load balance between the two groups within the same NUMA node first 
before, going across NUMA node in order to minimize memory latency. This would 
require another sched-domain between domain 1 and 2, which represent all 16 
threads within a NUMA node (i.e. die sched-domain), this would allow scheduler 
to load balance within the NUMA node first, before going across NUMA node.

However, since the current code decides that x86_has_numa_in_package is true, it 
omits the die sched-domain. In order to avoid this, we are proposing to 
represent cpuinfo_x86.phys_proc_id using NUMA node ID (i.e. die ID). And this is 
the main point of the patch series.

Thanks,
Suravee