linux-kernel - Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3d658972-6a9f-4614-9532-d322bdd7c26b@efficios.com>
Date: Wed, 2 Oct 2024 08:45:06 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Marco Elver <elver@...gle.com>, Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, linux-kernel@...r.kernel.org,
 Valentin Schneider <vschneid@...hat.com>, Mel Gorman <mgorman@...e.de>,
 Steven Rostedt <rostedt@...dmis.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Yury Norov <yury.norov@...il.com>,
 Rasmus Villemoes <linux@...musvillemoes.dk>,
 Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency
 IDs for intermittent workloads

On 2024-10-02 11:49, Marco Elver wrote:
> On Mon, 30 Sept 2024 at 21:01, Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
>>
>> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
>> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
>> a reference to the concurrency id allocated for each CPU. This reference
>> expires shortly after a 100ms delay.
>>
>> These per-CPU references keep the per-mm-cid data cache-local in
>> situations where threads are running at least once on each CPU within
>> each 100ms window, thus keeping the per-cpu reference alive.
>>
>> However, intermittent workloads behaving in bursts spaced by more than
>> 100ms on each CPU exhibit bad cache locality and degraded performance
>> compared to purely per-cpu data indexing, because concurrency IDs are
>> allocated over various CPUs and cores, therefore losing cache locality
>> of the associated data.
>>
>> Introduce the following changes to improve per-mm-cid cache locality:
>>
>> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
>>    track of which mm_cid value was last used, and use it as a hint to
>>    attempt re-allocating the same concurrency ID the next time this
>>    mm/cpu needs to allocate a concurrency ID,
>>
>> - Add a per-mm CPUs allowed mask, which keeps track of the union of
>>    CPUs allowed for all threads belonging to this mm. This cpumask is
>>    only set during the lifetime of the mm, never cleared, so it
>>    represents the union of all the CPUs allowed since the beginning of
>>    the mm lifetime. (note that the mm_cpumask() is really arch-specific
>>    and tailored to the TLB flush needs, and is thus _not_ a viable
>>    approach for this)
>>
>> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
>>    per-mm CPUs allowed mask (for fast access),
>>
>> - Add a per-mm nr_cids_used to keep track of the highest concurrency
>>    ID allocated for the mm. This is used for expanding the concurrency ID
>>    allocation within the upper bound defined by:
>>
>>      min(mm->nr_cpus_allowed, mm->mm_users)
>>
>>    When the next unused CID value reaches this threshold, stop trying
>>    to expand the cid allocation and use the first available cid value
>>    instead.
>>
>> Spreading allocation to use all the cid values within the range
>>
>>    [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>>
>> improves cache locality while preserving mm_cid compactness within the
>> expected user limits.
>>
>> - In __mm_cid_try_get, only return cid values within the range
>>    [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
>>    prevents allocating cids above the number of allowed cpus in
>>    rare scenarios where cid allocation races with a concurrent
>>    remote-clear of the per-mm/cpu cid. This improvement is made
>>    possible by the addition of the per-mm CPUs allowed mask.
>>
>> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
>>    t->nr_cpus_allowed. This criterion was really meant to compare
>>    the number of mm->mm_users to the number of CPUs allowed for the
>>    entire mm. Therefore, the prior comparison worked fine when all
>>    threads shared the same CPUs allowed mask, but not so much in
>>    scenarios where those threads have different masks (e.g. each
>>    thread pinned to a single CPU). This improvement is made
>>    possible by the addition of the per-mm CPUs allowed mask.
>>
>> * Benchmarks
>>
>> Each thread increments 16kB worth of 8-bit integers in bursts, with
>> a configurable delay between each thread's execution. Each thread run
>> one after the other (no threads run concurrently). The order of
>> thread execution in the sequence is random. The thread execution
>> sequence begins again after all threads have executed. The 16kB areas
>> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
>> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
>> own core.
>>
>> Testing configurations:
>>
>> 8-core/1-L3:        Use 8 cores within a single L3
>> 24-core/24-L3:      Use 24 cores, 1 core per L3
>> 192-core/24-L3:     Use 192 cores (all cores in the system)
>> 384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)
>>
>> Intermittent workload delays between threads: 200ms, 10ms.
>>
>> Hardware:
>>
>> CPU(s):                   384
>>    On-line CPU(s) list:    0-383
>> Vendor ID:                AuthenticAMD
>>    Model name:             AMD EPYC 9654 96-Core Processor
>>      Thread(s) per core:   2
>>      Core(s) per socket:   96
>>      Socket(s):            2
>> Caches (sum of all):
>>    L1d:                    6 MiB (192 instances)
>>    L1i:                    6 MiB (192 instances)
>>    L2:                     192 MiB (192 instances)
>>    L3:                     768 MiB (24 instances)
>>
>> Each result is an average of 5 test runs. The cache-local speedup
>> is calculated as: (cache-local mm_cid) / (mm_cid).
>>
>> Intermittent workload delay: 200ms
>>
>>                       per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>>                           (ns)      (ns)                  (ns)
>> 8-core/1-L3             1374      19289                  1336            14.4x
>> 24-core/24-L3           2423      26721                  1594            16.7x
>> 192-core/24-L3          2291      15826                  2153             7.3x
>> 384-thread/24-L3        1874      13234                  1907             6.9x
>>
>> Intermittent workload delay: 10ms
>>
>>                       per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>>                           (ns)      (ns)                  (ns)
>> 8-core/1-L3               662       756                   686             1.1x
>> 24-core/24-L3            1378      3648                  1035             3.5x
>> 192-core/24-L3           1439     10833                  1482             7.3x
>> 384-thread/24-L3         1503     10570                  1556             6.8x
>>
>> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
>>    patch series with a simpler and more general approach. ]
>>
>> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
>> Acked-by: Marco Elver <elver@...gle.com>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Valentin Schneider <vschneid@...hat.com>
>> Cc: Mel Gorman <mgorman@...e.de>
>> Cc: Steven Rostedt <rostedt@...dmis.org>
>> Cc: Vincent Guittot <vincent.guittot@...aro.org>
>> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
>> Cc: Ben Segall <bsegall@...gle.com>
>> Cc: Dmitry Vyukov <dvyukov@...gle.com>
>> Cc: Marco Elver <elver@...gle.com>
>> Cc: Yury Norov <yury.norov@...il.com>
>> Cc: Rasmus Villemoes <linux@...musvillemoes.dk>
>> ---
>> Changes since v0:
>> - On migration, do not move the source cid to the destination cpu if the
>>    destination cpu has a recent cid value set.
>>
>> Changes since v2:
>> - Rebase on v6.11.1.
> 
> I think the versioning and changelog got confused. I see the changes
> from [1] which was already v2 are included in this one.
> 
> [1] https://lore.kernel.org/all/5cf2c0a5-7a99-4294-b316-eee07896ddf6@efficios.com/T/#u

Which means I should have tagged this series [PATCH v3]. Sorry about
that.

> 
> In any case, I'll reiterate my Ack as this looks like an improvement
> for the common case.
> 
> Acked-by: Marco Elver <elver@...gle.com>

Thanks!

Peter, should I re-send as is with a v3 tag, or is it OK for merge ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com