linux-kernel - Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0edc398e-d193-4c2d-907e-f5db93143f79@efficios.com>
Date: Thu, 12 Sep 2024 07:33:58 +0200
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Marco Elver <elver@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 linux-kernel@...r.kernel.org, Valentin Schneider <vschneid@...hat.com>,
 Mel Gorman <mgorman@...e.de>, Steven Rostedt <rostedt@...dmis.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Yury Norov <yury.norov@...il.com>,
 Rasmus Villemoes <linux@...musvillemoes.dk>,
 Dmitry Vyukov <dvyukov@...gle.com>
Subject: Re: [PATCH v2 1/1] sched: Improve cache locality of RSEQ concurrency
 IDs for intermittent workloads

On 2024-09-12 12:38, Marco Elver wrote:
> On Mon, 9 Sept 2024 at 23:15, Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
>>
>> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
>> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
>> a reference to the concurrency id allocated for each CPU. This reference
>> expires shortly after a 100ms delay.
>>
>> These per-CPU references keep the per-mm-cid data cache-local in
>> situations where threads are running at least once on each CPU within
>> each 100ms window, thus keeping the per-cpu reference alive.
> 
> One orthogonal idea that I recall: If a thread from a different thread
> group (i.e. another process) was scheduled on that CPU, the CID can
> also be invalidated because the caches are likely polluted. Fixed
> values like 100ms seem rather arbitrary and it may work for one system
> but not another.

That depends on the cache usage pattern of the different thread group:
it's also possible that the other thread group does not perform that
many stores to memory before the original thread group is scheduled
back, thus keeping the cache content untouched.

The ideal metric there would probably be based on PMU counters, but
I doubt we want to go there.

[...]
> 
> I like the simpler and more general approach vs. the NUMA-only
> approach! Attempting to reallocate the previously assigned CID seems
> to go a long way.

Indeed it does!

> 
> However, this doesn't quite do L3-awareness as mentioned in [1], right?
> What I can tell is that this patch improves cache locality for threads
> scheduled back on the _same CPU_, but not if those threads are
> scheduled on a _set of CPUs_ sharing the _same L3_ cache. So if e.g. a
> thread is scheduled from CPU2 to CPU3, but those 2 CPUs share the same
> L3 cache, that thread will get a completely new CID and is unlikely to
> hit in the L3 cache when accessing the per-CPU data.
> 
> [1] https://github.com/google/tcmalloc/issues/144#issuecomment-2307739715
> 
> Maybe I missed it, or you are planning to add it in future?

In my benchmarks, I noticed that preserving cache-locality at the L1 and
L2 levels was important as well.

I would like to understand better the use-case you refer to for L3
locality. AFAIU, this implies a scenario where the scheduler migrates
a thread from CPU 2 to CPU 3 (both with the same L3), and you would
like to migrate the concurrency ID along.

When the number of threads is < number of mm allowed cpus, the
migrate hooks steal the concurrency ID from CPU 2 and moves it to
CPU 3 if there is only a single thread from that mm on CPU 2, which
does what you wish.

When the number of threads is >= number of mm allowed cpus, the
migrate hook is skipped, and the concurrency ID from CPU 2 is
left in place, favoring cache locality at L1/L2 levels. In that
case it's the scheduler's decision to migrate the thread from
CPU 2 to CPU 3, so I would think improving the scheduler decisions
about migration and minimizing thread movement would be more
relevant than trying to optimize concurrency ID movement.

But I may not be fully understanding your use-case.

> 
> In any case, the current patch is definitely an improvement:
> 
> Acked-by: Marco Elver <elver@...gle.com>

Thanks a lot for your feedback!

Mathieu

> 
> Thanks,
> -- Marco

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com