[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eee21fae-dc64-40bf-90ac-c7228ae7ef48@efficios.com>
Date: Wed, 11 Dec 2024 12:07:53 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>, Mel Gorman <mgorman@...e.de>,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc: Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work
On 2024-12-11 07:27, Gabriele Monaco wrote:
>
> On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
>> On 2024-12-09 10:33, Mathieu Desnoyers wrote:
>>> A small tweak on your proposed approach: in phase 1, get each
>>> thread
>>> to publish which mm_cid they observe, and select one thread which
>>> has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
>>> that will keep running in phase 2 (in addition to the main thread).
>>>
>>> All threads other than the main thread and that selected thread
>>> exit
>>> and are joined before phase 2.
>>>
>>> So you end up in phase 2 with:
>>>
>>> - main (observed any mm_cid)
>>> - selected thread (observed mm_cid > 1, possibly largest)
>>>
>>> Then after a while, the selected thread should observe a
>>> mm_cid <= 1.
>>>
>>> This test should be skipped if there are less than 3 CPUs in
>>> allowed cpumask (sched_getaffinity).
>>
>> Even better:
>>
>> For a sched_getaffinity with N cpus:
>>
>> - If N == 1 -> skip (we cannot validate anything)
>>
>> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
>> also pinned to a cpu.
>>
>> Publish the mm_cids observed by each thread, including main thread.
>>
>> Select a new leader for phase 2: a thread which has observed nonzero
>> mm_cid. Each other thread including possibly main thread issue
>> pthread_exit, and the new leader does pthread join on each other.
>>
>> Then check that the new leader eventually observe mm_cid == 0.
>>
>> And it works with an allowed cpu mask that has only 2 cpus.
>
> Sounds even neater, thanks for the tips, I'll try this last one out!
>
> Coming back to the implementation, I have been trying to validate my
> approach with this test, wrapped my head around it, and found out that
> the test can't actually pass on the latest upstream.
>
> When an mm_cid is lazy dropped to compact the mask, it is again re-
> assigned while switching in.
> The the change introduced in "sched: Improve cache locality of RSEQ
> concurrency IDs for intermittent workloads" adds a recent_cid and it
> seems that is never unset during the test (nothing migrates).
Good point!
>
> Now, I'm still running my first version of the test, so I have a thread
> running on CPU0 with mm_cid=0 and another running on CPU127 with
> mm_cid, say, 127 (weight=2).
> In practice, the test is expecting 127 to be dropped (>2) but this is
> not the case since 127 could exhibit better cache locality, so it is
> selected on the next round.
Understood.
>
> Here's where I'm in doubt, is a compact map more desirable than reusing
> the same mm_cids for cache locality?
This is a tradeoff between:
A) Preserving cache locality after a transition from many threads to few
threads, or after reducing the hamming weight of the allowed cpu mask.
B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask easy
to document and understand.
C) Allowing applications to eventually react to mm_cid compaction after
reduction of the nr threads or allowed cpu mask, making the tracking
of mm_cid compaction easier by shrinking it back towards 0 or not.
D) Making sure applications that periodically reduce and then increase
again the nr threads or allowed cpu mask still benefit from good
cache locality with mm_cid.
> If not, should we perhaps ignore the recent_cid if it's larger than the
> map weight?
> It seems the only way the recent_cid is unset is with migrations, but
> I'm not sure if forcing one would make the test vain as the cid could
> be dropped outside of task_mm_cid_work.
>
> What do you think?
Can you try this patch ? (compile-tested only)
commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Date: Wed Dec 11 11:53:01 2024 -0500
sched: shrink mm_cid allocation with nr thread/affinity
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
{
struct cpumask *cidmask = mm_cidmask(mm);
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
- int cid = __this_cpu_read(pcpu_cid->recent_cid);
+ int cid, max_nr_cid, allowed_max_nr_cid;
+ /*
+ * After shrinking the number of threads or reducing the number
+ * of allowed cpus, reduce the value of max_nr_cid so expansion
+ * of cid allocation will preserve cache locality if the number
+ * of threads or allowed cpus increase again.
+ */
+ max_nr_cid = atomic_read(&mm->max_nr_cid);
+ while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed), atomic_read(&mm->mm_users))),
+ max_nr_cid > allowed_max_nr_cid) {
+ if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid))
+ break;
+ }
/* Try to re-use recent cid. This improves cache locality. */
- if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+ cid = __this_cpu_read(pcpu_cid->recent_cid);
+ if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+ !cpumask_test_and_set_cpu(cid, cidmask))
return cid;
/*
* Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
* and number of threads. Expanding cid allocation as much as
* possible improves cache locality.
*/
- cid = atomic_read(&mm->max_nr_cid);
- while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
- if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+ while (max_nr_cid < allowed_max_nr_cid) {
+ if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + 1))
continue;
- if (!cpumask_test_and_set_cpu(cid, cidmask))
- return cid;
+ if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+ return max_nr_cid;
}
/*
* Find the first available concurrency id.
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists