[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250929114225.36172-1-gmonaco@redhat.com>
Date: Mon, 29 Sep 2025 13:42:21 +0200
From: Gabriele Monaco <gmonaco@...hat.com>
To: linux-kernel@...r.kernel.org,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>
Cc: Gabriele Monaco <gmonaco@...hat.com>
Subject: [PATCH v3 0/4] sched: Run task_mm_cid_work in batches to lower latency
This V2 of [1] is a continuation of [2] but using a simpler approach.
The task_mm_cid_work runs as a task_work returning to userspace and
causes a non-negligible scheduling latency, mostly due to its iterations
over all cores.
Split the work into several batches, each call to task_mm_cid_work will
not run for all cpus but just for a configurable number of cpus. Next
runs will pick up where the previous left off.
The mechanism that avoids running too frequently (100ms) is enforced
only when finishing all cpus, that is when starting from 0.
Also improve the predictability of the scan on short running tasks by
scheduling it from rseq_sched_switch_event, which runs on every task
switch (similar behaviour to [2]), the same workaround on the tick for
long running tasks seen in [2] was ported also here.
The duration of a full scan depends now on the workload, where workloads
with less threads are more likely to take longer.
Tests with cyclictest (threads with 100us period) and hackbench
(processes) on a 128 CPUs machine measuring the time to complete the
scan as well as the time between non-complete scans showed the following
(batch size of 8: 16 iterations):
cyclictest: delay 0-400 us , complete scan 1.5-2 ms
hackbench: delay 5us - 3ms , complete scan 1.5-15 ms
With the observed worst case for hackbench, it would take more than 800
CPUs to reach the current 100ms limit.
The problematic latency observed on a full scan (128 CPUs), had a
duration of the call to task_mm_cid_scan around 35 us, where 20 us is
considered a relevant latency on this machine.
Measurements showed these durations for each call to task_mm_cid_scan:
batch size 8: 1-11 us (majority below 10)
batch size 16: 3-16 us (majority below 10)
batch size 32: 10-21 us (majority above 15)
This led to a choice of 16 as default batch size.
Patch 1 add support for prev_sum_exec_runtime to the RT, deadline and
sched_ext classes as it is supported by fair, this is required to avoid
calling rseq_preempt on tick if the runtime is below a threshold.
Patch 2 schedules the task_mm_cid_work from rseq_sched_switch_event().
Patch 3 splits the work into batches.
Patch 4 adds a selftest to validate the functionality of the
task_mm_cid_work (i.e. to compact the mm_cids).
Changes since V2 [3]:
* Rebase on rseq rework [4].
* Revert to using task_work.
* Start the work in rseq_sched_switch_event().
Changes since V1 [1]:
* Use cpu_possible_mask in scan.
* Make sure batches have the same number of CPUs also if mask is sparse.
* Run the task on rseq_handle_notify_resume as in [2] but call directly.
* Schedule the work and mm_cid update on tick for long running tasks.
* Fix condition for need_scan only on first batch.
* Change RSEQ_CID_SCAN_BATCH default to be a power of 2.
* Rebase selftest on [2].
* Increase the selftest timeout on large systems.
To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
To: Thomas Gleixner <tglx@...utronix.de>
To: Ingo Molnar <mingo@...hat.com>
[1] - https://lore.kernel.org/lkml/20250217112317.258716-1-gmonaco@redhat.com
[2] - https://lore.kernel.org/lkml/20250707144824.117014-1-gmonaco@redhat.com
[3] - https://lore.kernel.org/lkml/20250716160603.138385-6-gmonaco@redhat.com
[4] - https://lore.kernel.org/lkml/20250908212737.353775467@linutronix.de
Gabriele Monaco (4):
sched: Add prev_sum_exec_runtime support for RT, DL and SCX classes
rseq: Schedule the mm_cid_compaction from rseq_sched_switch_event()
sched: Compact RSEQ concurrency IDs in batches
selftests/rseq: Add test for mm_cid compaction
include/linux/mm_types.h | 26 +++
include/linux/rseq.h | 3 +
include/linux/sched.h | 3 +
init/Kconfig | 12 ++
kernel/sched/core.c | 79 ++++++-
kernel/sched/deadline.c | 1 +
kernel/sched/ext.c | 1 +
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 2 +
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 204 ++++++++++++++++++
12 files changed, 324 insertions(+), 11 deletions(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
base-commit: 1822acbae2c9b8a6e6472809b42eab72cd7bf80c
--
2.51.0
Powered by blists - more mailing lists