[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtdqhmKmbVsCSAkJ@yury-ThinkPad>
Date: Tue, 3 Sep 2024 12:59:02 -0700
From: Yury Norov <yury.norov@...il.com>
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
linux-kernel@...r.kernel.org,
Valentin Schneider <vschneid@...hat.com>,
Mel Gorman <mgorman@...e.de>, Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Ben Segall <bsegall@...gle.com>,
Rasmus Villemoes <linux@...musvillemoes.dk>,
Dmitry Vyukov <dvyukov@...gle.com>, Marco Elver <elver@...gle.com>
Subject: Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ
concurrency IDs for intermittent workloads
On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
>
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.
>
> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
>
> Introduce the following changes to improve per-mm-cid cache locality:
>
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
> track of which mm_cid value was last used, and use it as a hint to
> attempt re-allocating the same concurrency ID the next time this
> mm/cpu needs to allocate a concurrency ID,
>
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
> CPUs allowed for all threads belonging to this mm. This cpumask is
> only set during the lifetime of the mm, never cleared, so it
> represents the union of all the CPUs allowed since the beginning of
> the mm lifetime. (note that the mm_cpumask() is really arch-specific
> and tailored to the TLB flush needs, and is thus _not_ a viable
> approach for this)
>
> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
> per-mm CPUs allowed mask (for fast access),
>
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
> ID allocated for the mm. This is used for expanding the concurrency ID
> allocation within the upper bound defined by:
>
> min(mm->nr_cpus_allowed, mm->mm_users)
>
> When the next unused CID value reaches this threshold, stop trying
> to expand the cid allocation and use the first available cid value
> instead.
>
> Spreading allocation to use all the cid values within the range
>
> [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
>
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.
>
> - In __mm_cid_try_get, only return cid values within the range
> [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
> prevents allocating cids above the number of allowed cpus in
> rare scenarios where cid allocation races with a concurrent
> remote-clear of the per-mm/cpu cid. This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
> t->nr_cpus_allowed. This criterion was really meant to compare
> the number of mm->mm_users to the number of CPUs allowed for the
> entire mm. Therefore, the prior comparison worked fine when all
> threads shared the same CPUs allowed mask, but not so much in
> scenarios where those threads have different masks (e.g. each
> thread pinned to a single CPU). This improvement is made
> possible by the addition of the per-mm CPUs allowed mask.
>
> * Benchmarks
>
> Each thread increments 16kB worth of 8-bit integers in bursts, with
> a configurable delay between each thread's execution. Each thread run
> one after the other (no threads run concurrently). The order of
> thread execution in the sequence is random. The thread execution
> sequence begins again after all threads have executed. The 16kB areas
> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
> own core.
>
> Testing configurations:
>
> 8-core/1-L3: Use 8 cores within a single L3
> 24-core/24-L3: Use 24 cores, 1 core per L3
> 192-core/24-L3: Use 192 cores (all cores in the system)
> 384-thread/24-L3: Use 384 HW threads (all HW threads in the system)
>
> Intermittent workload delays between threads: 200ms, 10ms.
>
> Hardware:
>
> CPU(s): 384
> On-line CPU(s) list: 0-383
> Vendor ID: AuthenticAMD
> Model name: AMD EPYC 9654 96-Core Processor
> Thread(s) per core: 2
> Core(s) per socket: 96
> Socket(s): 2
> Caches (sum of all):
> L1d: 6 MiB (192 instances)
> L1i: 6 MiB (192 instances)
> L2: 192 MiB (192 instances)
> L3: 768 MiB (24 instances)
>
> Each result is an average of 5 test runs. The cache-local speedup
> is calculated as: (cache-local mm_cid) / (mm_cid).
>
> Intermittent workload delay: 200ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 1374 19289 1336 14.4x
> 24-core/24-L3 2423 26721 1594 16.7x
> 192-core/24-L3 2291 15826 2153 7.3x
> 384-thread/24-L3 1874 13234 1907 6.9x
>
> Intermittent workload delay: 10ms
>
> per-cpu mm_cid cache-local mm_cid cache-local speedup
> (ns) (ns) (ns)
> 8-core/1-L3 662 756 686 1.1x
> 24-core/24-L3 1378 3648 1035 3.5x
> 192-core/24-L3 1439 10833 1482 7.3x
> 384-thread/24-L3 1503 10570 1556 6.8x
>
> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
> patch series with a simpler and more general approach. ]
>
> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Valentin Schneider <vschneid@...hat.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Steven Rostedt <rostedt@...dmis.org>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@....com>
> Cc: Ben Segall <bsegall@...gle.com>
> Cc: Dmitry Vyukov <dvyukov@...gle.com>
> Cc: Marco Elver <elver@...gle.com>
> ---
> fs/exec.c | 2 +-
> include/linux/mm_types.h | 66 ++++++++++++++++++++++++++++++++++------
> kernel/fork.c | 2 +-
> kernel/sched/core.c | 7 +++--
> kernel/sched/sched.h | 47 +++++++++++++++++++---------
> 5 files changed, 97 insertions(+), 27 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0c17e59e3767..7e73b0fc1305 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1039,7 +1039,7 @@ static int exec_mmap(struct mm_struct *mm)
> active_mm = tsk->active_mm;
> tsk->active_mm = mm;
> tsk->mm = mm;
> - mm_init_cid(mm);
> + mm_init_cid(mm, tsk);
> /*
> * This prevents preemption while active_mm is being loaded and
> * it and mm are being updated, which could cause problems for
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index af3a0256fa93..7d63d27862e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -755,6 +755,7 @@ struct vm_area_struct {
> struct mm_cid {
> u64 time;
> int cid;
> + int recent_cid;
> };
> #endif
>
> @@ -825,6 +826,20 @@ struct mm_struct {
> * When the next mm_cid scan is due (in jiffies).
> */
> unsigned long mm_cid_next_scan;
> + /**
> + * @nr_cpus_allowed: Number of CPUs allowed for mm.
> + *
> + * Number of CPUs allowed in the union of all mm's
> + * threads allowed CPUs.
> + */
> + atomic_t nr_cpus_allowed;
> + /**
> + * @nr_cids_used: Number of used concurrency IDs.
> + *
> + * Track the highest concurrency ID allocated for the
> + * mm: nr_cids_used - 1.
> + */
> + atomic_t nr_cids_used;
> #endif
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* size of all page tables */
> @@ -1143,18 +1158,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
> return cid & ~MM_CID_LAZY_PUT;
> }
>
> +/*
> + * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
> + */
> +static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
> +{
> + unsigned long bitmap = (unsigned long)mm;
> +
> + bitmap += offsetof(struct mm_struct, cpu_bitmap);
> + /* Skip cpu_bitmap */
> + bitmap += cpumask_size();
> + return (struct cpumask *)bitmap;
> +}
> +
> /* Accessor for struct mm_struct's cidmask. */
> static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
> {
> - unsigned long cid_bitmap = (unsigned long)mm;
> + unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
>
> - cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> - /* Skip cpu_bitmap */
> + /* Skip mm_cpus_allowed */
> cid_bitmap += cpumask_size();
> return (struct cpumask *)cid_bitmap;
> }
>
> -static inline void mm_init_cid(struct mm_struct *mm)
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
> {
> int i;
>
> @@ -1162,17 +1189,21 @@ static inline void mm_init_cid(struct mm_struct *mm)
> struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>
> pcpu_cid->cid = MM_CID_UNSET;
> + pcpu_cid->recent_cid = MM_CID_UNSET;
> pcpu_cid->time = 0;
> }
> + atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed);
> + atomic_set(&mm->nr_cids_used, 0);
> + cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr);
> cpumask_clear(mm_cidmask(mm));
> }
>
> -static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
> +static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
> {
> mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
> if (!mm->pcpu_cid)
> return -ENOMEM;
> - mm_init_cid(mm);
> + mm_init_cid(mm, p);
> return 0;
> }
> #define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
> @@ -1185,16 +1216,33 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
>
> static inline unsigned int mm_cid_size(void)
> {
> - return cpumask_size();
> + return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
> +}
> +
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
> +{
> + struct cpumask *mm_allowed = mm_cpus_allowed(mm);
> + int cpu, nr_set = 0;
> +
> + if (!mm)
> + return;
> + /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
> + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
> + if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
> + nr_set++;
> + }
You can do the same nicer:
for_each_cpu(cpu, cpumask)
nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
This should be faster and a bit simpler, to me.
What concerns me is that you call atomic function in a loop, which makes
the whole procedure non-atomic. If it's OK, can you put a comment why a
series of atomic ops is OK here? If not - I believe external locking
would be needed.
Thanks,
Yury
> + atomic_add(nr_set, &mm->nr_cpus_allowed);
> }
> #else /* CONFIG_SCHED_MM_CID */
> -static inline void mm_init_cid(struct mm_struct *mm) { }
> -static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
> +static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
> static inline void mm_destroy_cid(struct mm_struct *mm) { }
> +
> static inline unsigned int mm_cid_size(void)
> {
> return 0;
> }
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
> #endif /* CONFIG_SCHED_MM_CID */
>
> struct mmu_gather;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 99076dbe27d8..b44f545ad82c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1298,7 +1298,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> if (init_new_context(p, mm))
> goto fail_nocontext;
>
> - if (mm_alloc_cid(mm))
> + if (mm_alloc_cid(mm, p))
> goto fail_cid;
>
> if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e84a3b7b7bb..3243e9abfefb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2784,6 +2784,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
> put_prev_task(rq, p);
>
> p->sched_class->set_cpus_allowed(p, ctx);
> + mm_set_cpus_allowed(p->mm, ctx->new_mask);
>
> if (queued)
> enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> @@ -11784,6 +11785,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
> */
> if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
> return -1;
> + WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET);
> return src_cid;
> }
>
> @@ -11825,7 +11827,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
> dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
> dst_cid = READ_ONCE(dst_pcpu_cid->cid);
> if (!mm_cid_is_unset(dst_cid) &&
> - atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
> + atomic_read(&mm->mm_users) >= atomic_read(&mm->nr_cpus_allowed))
> return;
> src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
> src_rq = cpu_rq(src_cpu);
> @@ -11843,6 +11845,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
> /* Move src_cid to dst cpu. */
> mm_cid_snapshot_time(dst_rq, mm);
> WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
> + WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid);
> }
>
> static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
> @@ -12079,7 +12082,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
> * Matches barrier in sched_mm_cid_remote_clear_old().
> */
> smp_mb();
> - t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
> + t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
> }
> rseq_set_notify_resume(t);
> }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 38aeedd8a6cc..4d11dbd5847b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3311,24 +3311,40 @@ static inline void mm_cid_put(struct mm_struct *mm)
> __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> }
>
> -static inline int __mm_cid_try_get(struct mm_struct *mm)
> +static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
> {
> - struct cpumask *cpumask;
> - int cid;
> + struct cpumask *cidmask = mm_cidmask(mm);
> + struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> + int cid = __this_cpu_read(pcpu_cid->recent_cid);
>
> - cpumask = mm_cidmask(mm);
> + /* Try to re-use recent cid. This improves cache locality. */
> + if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
> + return cid;
> + /*
> + * Expand cid allocation if used cids are below the number cpus
> + * allowed and number of threads. Expanding cid allocation as
> + * much as possible improves cache locality.
> + */
> + cid = atomic_read(&mm->nr_cids_used);
> + while (cid < atomic_read(&mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
> + if (!atomic_try_cmpxchg(&mm->nr_cids_used, &cid, cid + 1))
> + continue;
> + if (!cpumask_test_and_set_cpu(cid, cidmask))
> + return cid;
> + }
> /*
> + * Find the first available concurrency id.
> * Retry finding first zero bit if the mask is temporarily
> * filled. This only happens during concurrent remote-clear
> * which owns a cid without holding a rq lock.
> */
> for (;;) {
> - cid = cpumask_first_zero(cpumask);
> - if (cid < nr_cpu_ids)
> + cid = cpumask_first_zero(cidmask);
> + if (cid < atomic_read(&mm->nr_cpus_allowed))
> break;
> cpu_relax();
> }
> - if (cpumask_test_and_set_cpu(cid, cpumask))
> + if (cpumask_test_and_set_cpu(cid, cidmask))
> return -1;
> return cid;
> }
> @@ -3345,7 +3361,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
> WRITE_ONCE(pcpu_cid->time, rq->clock);
> }
>
> -static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> +static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
> + struct mm_struct *mm)
> {
> int cid;
>
> @@ -3355,13 +3372,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> * guarantee forward progress.
> */
> if (!READ_ONCE(use_cid_lock)) {
> - cid = __mm_cid_try_get(mm);
> + cid = __mm_cid_try_get(t, mm);
> if (cid >= 0)
> goto end;
> raw_spin_lock(&cid_lock);
> } else {
> raw_spin_lock(&cid_lock);
> - cid = __mm_cid_try_get(mm);
> + cid = __mm_cid_try_get(t, mm);
> if (cid >= 0)
> goto unlock;
> }
> @@ -3381,7 +3398,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> * all newcoming allocations observe the use_cid_lock flag set.
> */
> do {
> - cid = __mm_cid_try_get(mm);
> + cid = __mm_cid_try_get(t, mm);
> cpu_relax();
> } while (cid < 0);
> /*
> @@ -3397,7 +3414,8 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> return cid;
> }
>
> -static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
> +static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
> + struct mm_struct *mm)
> {
> struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> struct cpumask *cpumask;
> @@ -3414,8 +3432,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
> if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
> __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> }
> - cid = __mm_cid_get(rq, mm);
> + cid = __mm_cid_get(rq, t, mm);
> __this_cpu_write(pcpu_cid->cid, cid);
> + __this_cpu_write(pcpu_cid->recent_cid, cid);
> return cid;
> }
>
> @@ -3467,7 +3486,7 @@ static inline void switch_mm_cid(struct rq *rq,
> prev->mm_cid = -1;
> }
> if (next->mm_cid_active)
> - next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
> + next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
> }
>
> #else
> --
> 2.39.2
Powered by blists - more mailing lists