lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANpmjNMjndyBAO3HKHkC+v7zNZv1XHvH5Fjd9S5q0Jj-sEkx-w@mail.gmail.com>
Date: Thu, 12 Sep 2024 18:38:30 +0200
From: Marco Elver <elver@...gle.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org, 
	Valentin Schneider <vschneid@...hat.com>, Mel Gorman <mgorman@...e.de>, 
	Steven Rostedt <rostedt@...dmis.org>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, 
	Yury Norov <yury.norov@...il.com>, Rasmus Villemoes <linux@...musvillemoes.dk>, 
	Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency
 IDs for intermittent workloads

On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
<mathieu.desnoyers@...icios.com> wrote:
>
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
>
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.

One orthogonal idea that I recall: If a thread from a different thread
group (i.e. another process) was scheduled on that CPU, the CID can
also be invalidated because the caches are likely polluted. Fixed
values like 100ms seem rather arbitrary and it may work for one system
but not another.

> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
>
> Introduce the following changes to improve per-mm-cid cache locality:
>
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
>   track of which mm_cid value was last used, and use it as a hint to
>   attempt re-allocating the same concurrency ID the next time this
>   mm/cpu needs to allocate a concurrency ID,
>
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
>   CPUs allowed for all threads belonging to this mm. This cpumask is
>   only set during the lifetime of the mm, never cleared, so it
>   represents the union of all the CPUs allowed since the beginning of
>   the mm lifetime. (note that the mm_cpumask() is really arch-specific
>   and tailored to the TLB flush needs, and is thus _not_ a viable
>   approach for this)
>
> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
>   per-mm CPUs allowed mask (for fast access),
>
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
>   ID allocated for the mm. This is used for expanding the concurrency ID
>   allocation within the upper bound defined by:
>
>     min(mm->nr_cpus_allowed, mm->mm_users)
>
>   When the next unused CID value reaches this threshold, stop trying
>   to expand the cid allocation and use the first available cid value
>   instead.
>
> Spreading allocation to use all the cid values within the range
>
>   [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.
>
> - In __mm_cid_try_get, only return cid values within the range
>   [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
>   prevents allocating cids above the number of allowed cpus in
>   rare scenarios where cid allocation races with a concurrent
>   remote-clear of the per-mm/cpu cid. This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.
>
> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
>   t->nr_cpus_allowed. This criterion was really meant to compare
>   the number of mm->mm_users to the number of CPUs allowed for the
>   entire mm. Therefore, the prior comparison worked fine when all
>   threads shared the same CPUs allowed mask, but not so much in
>   scenarios where those threads have different masks (e.g. each
>   thread pinned to a single CPU). This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.
>
> * Benchmarks
>
> Each thread increments 16kB worth of 8-bit integers in bursts, with
> a configurable delay between each thread's execution. Each thread run
> one after the other (no threads run concurrently). The order of
> thread execution in the sequence is random. The thread execution
> sequence begins again after all threads have executed. The 16kB areas
> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
> own core.
>
> Testing configurations:
>
> 8-core/1-L3:        Use 8 cores within a single L3
> 24-core/24-L3:      Use 24 cores, 1 core per L3
> 192-core/24-L3:     Use 192 cores (all cores in the system)
> 384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)
>
> Intermittent workload delays between threads: 200ms, 10ms.
>
> Hardware:
>
> CPU(s):                   384
>   On-line CPU(s) list:    0-383
> Vendor ID:                AuthenticAMD
>   Model name:             AMD EPYC 9654 96-Core Processor
>     Thread(s) per core:   2
>     Core(s) per socket:   96
>     Socket(s):            2
> Caches (sum of all):
>   L1d:                    6 MiB (192 instances)
>   L1i:                    6 MiB (192 instances)
>   L2:                     192 MiB (192 instances)
>   L3:                     768 MiB (24 instances)
>
> Each result is an average of 5 test runs. The cache-local speedup
> is calculated as: (cache-local mm_cid) / (mm_cid).
>
> Intermittent workload delay: 200ms
>
>                      per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>                          (ns)      (ns)                  (ns)
> 8-core/1-L3             1374      19289                  1336            14.4x
> 24-core/24-L3           2423      26721                  1594            16.7x
> 192-core/24-L3          2291      15826                  2153             7.3x
> 384-thread/24-L3        1874      13234                  1907             6.9x
>
> Intermittent workload delay: 10ms
>
>                      per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>                          (ns)      (ns)                  (ns)
> 8-core/1-L3               662       756                   686             1.1x
> 24-core/24-L3            1378      3648                  1035             3.5x
> 192-core/24-L3           1439     10833                  1482             7.3x
> 384-thread/24-L3         1503     10570                  1556             6.8x
>
> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
>   patch series with a simpler and more general approach. ]

I like the simpler and more general approach vs. the NUMA-only
approach! Attempting to reallocate the previously assigned CID seems
to go a long way.

However, this doesn't quite do L3-awareness as mentioned in [1], right?
What I can tell is that this patch improves cache locality for threads
scheduled back on the _same CPU_, but not if those threads are
scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
L3 cache, that thread will get a completely new CID and is unlikely to
hit in the L3 cache when accessing the per-CPU data.

[1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715

Maybe I missed it, or you are planning to add it in future?

In any case, the current patch is definitely an improvement:

Acked-by: Marco Elver <elver@...gle.com>

Thanks,
-- Marco

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ