linux-kernel - Re: [PATCH 1/2] sched: Compact RSEQ concurrency IDs in batches

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6a86f095-4f3b-46e8-8a42-51bff3d03405@efficios.com>
Date: Mon, 17 Feb 2025 14:46:10 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Gabriele Monaco <gmonaco@...hat.com>, linux-kernel@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, linux-mm@...ck.org
Cc: Ingo Molnar <mingo@...nel.org>, "Paul E. McKenney" <paulmck@...nel.org>
Subject: Re: [PATCH 1/2] sched: Compact RSEQ concurrency IDs in batches

On 2025-02-17 06:23, Gabriele Monaco wrote:
> Currently, the task_mm_cid_work function is called in a task work
> triggered by a scheduler tick to frequently compact the mm_cids of each
> process for each core. This can delay the execution of the corresponding
> thread for the entire duration of the function, negatively affecting the
> response in case of real time tasks. In practice, we observe
> task_mm_cid_work increasing the latency of 30-35us on a 128 cores
> system, this order of magnitude is meaningful under PREEMPT_RT.
> 
> Run the task_mm_cid_work in batches of up to CONFIG_RSEQ_CID_SCAN_BATCH
> cpus, this contains the duration of the delay for each scan.
> Also improve the duration by iterating for all present cpus and not for
> all possible.

Iterating only on present cpus is not enough on CONFIG_HOTPLUG=y,
because ACPI can dynamically add/remove CPUs from the set. If we end
up iterating only on present cpus, then we need to add a cpu hotplug
callback to handle the removal case, and I'm not sure the added
complexity is worth it here.

> 
> The task_mm_cid_work already contains a mechanism to avoid running more
> frequently than every 100ms, considering the function runs at every
> tick, assuming ticks every 1ms (HZ=1000 is common on distros) and
> assuming an unfavorable scenario of 1/10 ticks during task T runtime, we
> can compact the CIDs for task T in about 130ms by setting
> CONFIG_RSEQ_CID_SCAN_BATCH to 10 on a 128 cores machine.
> This value also drastically reduces the task work duration and is a more
> acceptable latency for the aforementioned machine.
> 
> Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid")
> Signed-off-by: Gabriele Monaco <gmonaco@...hat.com>
> ---
>   include/linux/mm_types.h |  8 ++++++++
>   init/Kconfig             | 12 ++++++++++++
>   kernel/sched/core.c      | 27 ++++++++++++++++++++++++---
>   3 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0234f14f2aa6b..1e0e491d2c5c2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -867,6 +867,13 @@ struct mm_struct {
>   		 * When the next mm_cid scan is due (in jiffies).
>   		 */
>   		unsigned long mm_cid_next_scan;
> +		/*
> +		 * @mm_cid_scan_cpu: Which cpu to start from in the next scan

Other similar comments have a "." at end of line.

> +		 *
> +		 * Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH after each scan
> +		 * save the next cpu index here (or 0 if we are done)

Suggested rewording:

Scan in batches of CONFIG_RSEQ_CID_SCAN_BATCH. This field holds
the next cpu index after each scan, or 0 if all batches are
done.


> +		 */
> +		unsigned int mm_cid_scan_cpu;
>   		/**
>   		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
>   		 *
> @@ -1249,6 +1256,7 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
>   	raw_spin_lock_init(&mm->cpus_allowed_lock);
>   	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
>   	cpumask_clear(mm_cidmask(mm));
> +	mm->mm_cid_scan_cpu = 0;
>   }
>   
>   static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b3..39f1d4c7980c0 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1813,6 +1813,18 @@ config DEBUG_RSEQ
>   
>   	  If unsure, say N.
>   
> +config RSEQ_CID_SCAN_BATCH
> +	int "Number of CPUs to scan every time we attempt mm_cid compaction"

Reword without "we".

> +	range 1 NR_CPUS
> +	default 10
> +	depends on SCHED_MM_CID
> +	help
> +	  CPUs are scanned pseudo-periodically to compact the CID of each task,
> +	  this operation can take a longer amount of time on systems with many
> +	  CPUs, resulting in higher scheduling latency for the current task.
> +	  A higher value means the CID is compacted faster, but results in
> +	  higher scheduling latency.
> +
>   config CACHESTAT_SYSCALL
>   	bool "Enable cachestat() system call" if EXPERT
>   	default y
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9aecd914ac691..8d1cce4ed62c6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10536,7 +10536,7 @@ static void task_mm_cid_work(struct callback_head *work)
>   	struct task_struct *t = current;
>   	struct cpumask *cidmask;
>   	struct mm_struct *mm;
> -	int weight, cpu;
> +	int weight, cpu, from_cpu, to_cpu;
>   
>   	SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
>   
> @@ -10546,6 +10546,15 @@ static void task_mm_cid_work(struct callback_head *work)
>   	mm = t->mm;
>   	if (!mm)
>   		return;
> +	cpu = from_cpu = READ_ONCE(mm->mm_cid_scan_cpu);
> +	to_cpu = from_cpu + CONFIG_RSEQ_CID_SCAN_BATCH;
> +	if (from_cpu > cpumask_last(cpu_present_mask)) {

See explanation about using possible rather than present.

> +		from_cpu = 0;
> +		to_cpu = CONFIG_RSEQ_CID_SCAN_BATCH;

If the cpu_possible_mask is sparsely populated, this will end
up doing batches that hit very few cpus. Instead, we should
count how many cpus are handled within each
for_each_cpu_from(cpu, cpu_possible_mask) loops below and break
when reaching CONFIG_RSEQ_CID_SCAN_BATCH.

> +	}
> +	if (from_cpu != 0)
> +		/* Delay scan only if we are done with all cpus. */
> +		goto cid_compact;
>   	old_scan = READ_ONCE(mm->mm_cid_next_scan);
>   	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
>   	if (!old_scan) {
> @@ -10561,17 +10570,29 @@ static void task_mm_cid_work(struct callback_head *work)
>   		return;
>   	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
>   		return;
> +
> +cid_compact:
> +	if (!try_cmpxchg(&mm->mm_cid_scan_cpu, &cpu, to_cpu))
> +		return;
>   	cidmask = mm_cidmask(mm);
>   	/* Clear cids that were not recently used. */
> -	for_each_possible_cpu(cpu)
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_present_mask) {
> +		if (cpu == to_cpu)
> +			break;
>   		sched_mm_cid_remote_clear_old(mm, cpu);
> +	}
>   	weight = cpumask_weight(cidmask);
>   	/*
>   	 * Clear cids that are greater or equal to the cidmask weight to
>   	 * recompact it.
>   	 */
> -	for_each_possible_cpu(cpu)
> +	cpu = from_cpu;
> +	for_each_cpu_from(cpu, cpu_present_mask) {
> +		if (cpu == to_cpu)
> +			break;
>   		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
> +	}

Here set mm->mm_cid_scan_cpu to the new next position which is
the result from the "for each" loop.

Thanks,

Mathieu

>   }
>   
>   void init_sched_mm_cid(struct task_struct *t)


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com