linux-kernel - Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share an LLC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8d4b2a2c-1044-c657-d73a-5afb96cc47d7@linux.intel.com>
Date:   Wed, 8 Nov 2017 16:00:38 -0800
From:   Dave Hansen <dave.hansen@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, tony.luck@...el.com,
        tim.c.chen@...ux.intel.com, hpa@...ux.intel.com, bp@...en8.de,
        rientjes@...gle.com, imammedo@...hat.com, prarit@...hat.com,
        toshi.kani@...com, brice.goglin@...il.com, mingo@...nel.org
Subject: Re: [RFC][PATCH] x86, sched: allow topolgies where NUMA nodes share
 an LLC

On 11/08/2017 01:31 AM, Peter Zijlstra wrote:
> And SNC makes it even smaller; it effectively puts a cache in between
> the two on-die nodes; not entirely unlike the s390 BOOK domain. Which
> makes ignoring NUMA even more tempting.
> 
> What does this topology approach do for those workloads?

What does this L3 topology do for workloads ignoring NUMA?

Let's just assume that an app is entirely NUMA unaware and that it is
accessing a large amount of memory entirely uniformly across the entire
system.  Let's also say that we just have a 2-socket system which now
shows up as having 4 NUMA nodes (one per slice, two slices per socket,
two sockets).  Let's also just say we have 20MB of L3 per socket, so
10MB per slice.

 - 1/4 of the memory accesses will be local to the slice and will have
   access to 10MB of L3.
 - 1/4 of the memory accesses will be to the *other* slice and will have
   access to 10MB of L3 (non-conflicting with the previous 10MB).  This
   access is marginally slower than the access to the local slice.
 - 1/2 of memory accesses will be cross-node and will have access to
   20MB of L3 (both slices' L3's).

That's all OK.  Without this halved-L3 (the previous Cluster-on-Die)
configuration, it looked like this:

 - 1/2 of the memory accesses will be local to the socket and have
   access to 20MB of L3.
 - 1/2 of memory accesses will be cross-node and will have access to
   20MB of L3 (both slices' L3's).

I'd argue that those two end up looking pretty much the same to an app.
The only difference is that the slice-local and slice-remote cache hits
have slightly different access latencies.  I don't think it's enough to
notice.

The place where it is not optimal is where an app does NUMA-local
accesses, then sees that it has 20MB of L3 (via CPUID) and expects to
*get* 20MB of L3.