linux-kernel - Re: [PATCH] sched: Move task_mm_cid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eee21fae-dc64-40bf-90ac-c7228ae7ef48@efficios.com>
Date: Wed, 11 Dec 2024 12:07:53 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>,
 Andrew Morton <akpm@...ux-foundation.org>, Mel Gorman <mgorman@...e.de>,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched: Move task_mm_cid_work to mm delayed work

On 2024-12-11 07:27, Gabriele Monaco wrote:
> 
> On Mon, 2024-12-09 at 10:48 -0500, Mathieu Desnoyers wrote:
>> On 2024-12-09 10:33, Mathieu Desnoyers wrote:
>>> A small tweak on your proposed approach: in phase 1, get each
>>> thread
>>> to publish which mm_cid they observe, and select one thread which
>>> has observed mm_cid > 1 (possibly the largest mm_cid) as the thread
>>> that will keep running in phase 2 (in addition to the main thread).
>>>
>>> All threads other than the main thread and that selected thread
>>> exit
>>> and are joined before phase 2.
>>>
>>> So you end up in phase 2 with:
>>>
>>> - main (observed any mm_cid)
>>> - selected thread (observed mm_cid > 1, possibly largest)
>>>
>>> Then after a while, the selected thread should observe a
>>> mm_cid <= 1.
>>>
>>> This test should be skipped if there are less than 3 CPUs in
>>> allowed cpumask (sched_getaffinity).
>>
>> Even better:
>>
>> For a sched_getaffinity with N cpus:
>>
>> - If N == 1 -> skip (we cannot validate anything)
>>
>> Phase 1: create N - 1 pthreads, each pinned to a CPU. main thread
>> also pinned to a cpu.
>>
>> Publish the mm_cids observed by each thread, including main thread.
>>
>> Select a new leader for phase 2: a thread which has observed nonzero
>> mm_cid. Each other thread including possibly main thread issue
>> pthread_exit, and the new leader does pthread join on each other.
>>
>> Then check that the new leader eventually observe mm_cid == 0.
>>
>> And it works with an allowed cpu mask that has only 2 cpus.
> 
> Sounds even neater, thanks for the tips, I'll try this last one out!
> 
> Coming back to the implementation, I have been trying to validate my
> approach with this test, wrapped my head around it, and found out that
> the test can't actually pass on the latest upstream.
> 
> When an mm_cid is lazy dropped to compact the mask, it is again re-
> assigned while switching in.
> The  the change introduced in "sched: Improve cache locality of RSEQ
> concurrency IDs for intermittent workloads" adds a recent_cid and it
> seems that is never unset during the test (nothing migrates).

Good point!

> 
> Now, I'm still running my first version of the test, so I have a thread
> running on CPU0 with mm_cid=0 and another running on CPU127 with
> mm_cid, say, 127 (weight=2).
> In practice, the test is expecting 127 to be dropped (>2) but this is
> not the case since 127 could exhibit better cache locality, so it is
> selected on the next round.

Understood.

> 
> Here's where I'm in doubt, is a compact map more desirable than reusing
> the same mm_cids for cache locality?

This is a tradeoff between:

A) Preserving cache locality after a transition from many threads to few
    threads, or after reducing the hamming weight of the allowed cpu mask.

B) Making the mm_cid guarantees wrt nr threads and allowed cpu mask easy
    to document and understand.

C) Allowing applications to eventually react to mm_cid compaction after
    reduction of the nr threads or allowed cpu mask, making the tracking
    of mm_cid compaction easier by shrinking it back towards 0 or not.

D) Making sure applications that periodically reduce and then increase
    again the nr threads or allowed cpu mask still benefit from good
    cache locality with mm_cid.


> If not, should we perhaps ignore the recent_cid if it's larger than the
> map weight?
> It seems the only way the recent_cid is unset is with migrations, but
> I'm not sure if forcing one would make the test vain as the cid could
> be dropped outside of task_mm_cid_work.
> 
> What do you think?

Can you try this patch ? (compile-tested only)

commit 500649e03c5c28443f431829732c580750657326
Author: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Date:   Wed Dec 11 11:53:01 2024 -0500

     sched: shrink mm_cid allocation with nr thread/affinity

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 76f5f53a645f..b92e79770a93 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3657,10 +3657,24 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
  {
  	struct cpumask *cidmask = mm_cidmask(mm);
  	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
-	int cid = __this_cpu_read(pcpu_cid->recent_cid);
+	int cid, max_nr_cid, allowed_max_nr_cid;
  
+	/*
+	 * After shrinking the number of threads or reducing the number
+	 * of allowed cpus, reduce the value of max_nr_cid so expansion
+	 * of cid allocation will preserve cache locality if the number
+	 * of threads or allowed cpus increase again.
+	 */
+	max_nr_cid = atomic_read(&mm->max_nr_cid);
+	while ((allowed_max_nr_cid = min_t(int, READ_ONCE(mm->nr_cpus_allowed), atomic_read(&mm->mm_users))),
+		max_nr_cid > allowed_max_nr_cid) {
+		if (atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, allowed_max_nr_cid))
+			break;
+	}
  	/* Try to re-use recent cid. This improves cache locality. */
-	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+	cid = __this_cpu_read(pcpu_cid->recent_cid);
+	if (!mm_cid_is_unset(cid) && cid < max_nr_cid &&
+	    !cpumask_test_and_set_cpu(cid, cidmask))
  		return cid;
  	/*
  	 * Expand cid allocation if the maximum number of concurrency
@@ -3668,12 +3682,11 @@ static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
  	 * and number of threads. Expanding cid allocation as much as
  	 * possible improves cache locality.
  	 */
-	cid = atomic_read(&mm->max_nr_cid);
-	while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
-		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
+	while (max_nr_cid < allowed_max_nr_cid) {
+		if (!atomic_try_cmpxchg(&mm->max_nr_cid, &max_nr_cid, max_nr_cid + 1))
  			continue;
-		if (!cpumask_test_and_set_cpu(cid, cidmask))
-			return cid;
+		if (!cpumask_test_and_set_cpu(max_nr_cid, cidmask))
+			return max_nr_cid;
  	}
  	/*
  	 * Find the first available concurrency id.

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com