[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CADxym3YMaz8_YkOidJVbKYAXiFLKp4KvYopR3rJRYkiYJvenWw@mail.gmail.com>
Date: Wed, 6 Aug 2025 16:44:19 +0800
From: Menglong Dong <menglong8.dong@...il.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Menglong Dong <menglong.dong@...ux.dev>,
Steven Rostedt <rostedt@...dmis.org>, Jiri Olsa <jolsa@...nel.org>, bpf <bpf@...r.kernel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>, Eduard Zingerman <eddyz87@...il.com>,
LKML <linux-kernel@...r.kernel.org>, Network Development <netdev@...r.kernel.org>
Subject: Re: Inlining migrate_disable/enable. Was: [PATCH bpf-next v2 02/18]
x86,bpf: add bpf_global_caller for global trampoline
On Fri, Aug 1, 2025 at 12:15 AM Alexei Starovoitov
<alexei.starovoitov@...il.com> wrote:
>
> On Mon, Jul 28, 2025 at 2:20 AM Menglong Dong <menglong8.dong@...il.com> wrote:
> >
> > On Thu, Jul 17, 2025 at 6:35 AM Alexei Starovoitov
> > <alexei.starovoitov@...il.com> wrote:
> > >
> > > On Wed, Jul 16, 2025 at 11:24 AM Peter Zijlstra <peterz@...radead.org> wrote:
> > > >
> > > > On Wed, Jul 16, 2025 at 09:56:11AM -0700, Alexei Starovoitov wrote:
> > > >
> > > > > Maybe Peter has better ideas ?
> > > >
> > > > Is it possible to express runqueues::nr_pinned as an alias?
> > > >
> > > > extern unsigned int __attribute__((alias("runqueues.nr_pinned"))) this_nr_pinned;
> > > >
> > > > And use:
> > > >
> > > > __this_cpu_inc(&this_nr_pinned);
> > > >
> > > >
> > > > This syntax doesn't actually seem to work; but can we construct
> > > > something like that?
> > >
> > > Yeah. Iant is right. It's a string and not a pointer dereference.
> > > It never worked.
> > >
> > > Few options:
> > >
> > > 1.
> > > struct rq {
> > > +#ifdef CONFIG_SMP
> > > + unsigned int nr_pinned;
> > > +#endif
> > > /* runqueue lock: */
> > > raw_spinlock_t __lock;
> > >
> > > @@ -1271,9 +1274,6 @@ struct rq {
> > > struct cpuidle_state *idle_state;
> > > #endif
> > >
> > > -#ifdef CONFIG_SMP
> > > - unsigned int nr_pinned;
> > > -#endif
> > >
> > > but ugly...
> > >
> > > 2.
> > > static unsigned int nr_pinned_offset __ro_after_init __used;
> > > RUNTIME_CONST(nr_pinned_offset, nr_pinned_offset)
> > >
> > > overkill for what's needed
> > >
> > > 3.
> > > OFFSET(RQ_nr_pinned, rq, nr_pinned);
> > > then
> > > #include <generated/asm-offsets.h>
> > >
> > > imo the best.
> >
> > I had a try. The struct rq is not visible to asm-offsets.c, so we
> > can't define it in arch/xx/kernel/asm-offsets.c. Do you mean
> > to define a similar rq-offsets.c in kernel/sched/ ? It will be more
> > complex than the way 2, and I think the second way 2 is
> > easier :/
>
> 2 maybe easier, but it's an overkill.
> I still think asm-offset is cleaner.
> arch/xx shouldn't be used, of course, since this nr_pinned should
> be generic for all archs.
> We can do something similar to drivers/memory/emif-asm-offsets.c
> and do that within kernel/sched/.
> rq-offsets.c as you said.
> It will generate rq-offsets.h in a build dir that can be #include-d.
>
> I thought about another alternative (as a derivative of 1):
> split nr_pinned from 'struct rq' into its own per-cpu variable,
> but I don't think that will work, since rq_has_pinned_tasks()
> doesn't always operate on this_rq().
> So the acceptable choices are realistically 1 and 3 and
> rq-offsets.c seems cleaner.
> Pls give it another try.
Generally speaking, the way 3 works. The only problem is how
we handle this_rq(). I introduced following code in
include/linux/sched.h:
struct rq;
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define this_rq_ptr() arch_raw_cpu_ptr(&runqueues)
The this_rq_ptr() is used in migrate_enable(). I have to use the
arch_raw_cpu_ptr() for it. this_cpu_ptr() can't be used here, as
it will fail on this_cpu_ptr -> raw_cpu_ptr -> __verify_pcpu_ptr:
#define __verify_pcpu_ptr(ptr) \
do { \
const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
(void)__vpp_verify; \
} while (0)
The struct rq is not available here, which makes the typeof((ptr) + 0)
fail during compiling. What can we do here?
According to my testing, the performance of fentry increased from
111M/s to 121M/s with migrate_enable/disable inlined.
Following is the whole patch:
-------------------------------------------------------------------------------------------
diff --git a/Kbuild b/Kbuild
index f327ca86990c..13324b4bbe23 100644
--- a/Kbuild
+++ b/Kbuild
@@ -34,13 +34,24 @@ arch/$(SRCARCH)/kernel/asm-offsets.s:
$(timeconst-file) $(bounds-file)
$(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s FORCE
$(call filechk,offsets,__ASM_OFFSETS_H__)
+# Generate rq-offsets.h
+
+rq-offsets-file := include/generated/rq-offsets.h
+
+targets += kernel/sched/rq-offsets.s
+
+kernel/sched/rq-offsets.s: $(offsets-file)
+
+$(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
+ $(call filechk,offsets,__RQ_OFFSETS_H__)
+
# Check for missing system calls
quiet_cmd_syscalls = CALL $<
cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags)
$(missing_syscalls_flags)
PHONY += missing-syscalls
-missing-syscalls: scripts/checksyscalls.sh $(offsets-file)
+missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
$(call cmd,syscalls)
# Check the manual modification of atomic headers
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 1fad1c8a4c76..3a1c08a75c09 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -369,64 +369,6 @@ static inline void preempt_notifier_init(struct
preempt_notifier *notifier,
#endif
-/*
- * Migrate-Disable and why it is undesired.
- *
- * When a preempted task becomes elegible to run under the ideal model (IOW it
- * becomes one of the M highest priority tasks), it might still have to wait
- * for the preemptee's migrate_disable() section to complete. Thereby suffering
- * a reduction in bandwidth in the exact duration of the migrate_disable()
- * section.
- *
- * Per this argument, the change from preempt_disable() to migrate_disable()
- * gets us:
- *
- * - a higher priority tasks gains reduced wake-up latency; with
preempt_disable()
- * it would have had to wait for the lower priority task.
- *
- * - a lower priority tasks; which under preempt_disable() could've instantly
- * migrated away when another CPU becomes available, is now constrained
- * by the ability to push the higher priority task away, which
might itself be
- * in a migrate_disable() section, reducing it's available bandwidth.
- *
- * IOW it trades latency / moves the interference term, but it stays in the
- * system, and as long as it remains unbounded, the system is not fully
- * deterministic.
- *
- *
- * The reason we have it anyway.
- *
- * PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a
- * number of primitives into becoming preemptible, they would also allow
- * migration. This turns out to break a bunch of per-cpu usage. To this end,
- * all these primitives employ migirate_disable() to restore this implicit
- * assumption.
- *
- * This is a 'temporary' work-around at best. The correct solution is getting
- * rid of the above assumptions and reworking the code to employ explicit
- * per-cpu locking or short preempt-disable regions.
- *
- * The end goal must be to get rid of migrate_disable(), alternatively we need
- * a schedulability theory that does not depend on abritrary migration.
- *
- *
- * Notes on the implementation.
- *
- * The implementation is particularly tricky since existing code patterns
- * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
- * This means that it cannot use cpus_read_lock() to serialize against hotplug,
- * nor can it easily migrate itself into a pending affinity mask change on
- * migrate_enable().
- *
- *
- * Note: even non-work-conserving schedulers like semi-partitioned depends on
- * migration, so migrate_disable() is not only a problem for
- * work-conserving schedulers.
- *
- */
-extern void migrate_disable(void);
-extern void migrate_enable(void);
-
/**
* preempt_disable_nested - Disable preemption inside a normally
preempt disabled section
*
@@ -471,7 +413,6 @@ static __always_inline void preempt_enable_nested(void)
DEFINE_LOCK_GUARD_0(preempt, preempt_disable(), preempt_enable())
DEFINE_LOCK_GUARD_0(preempt_notrace, preempt_disable_notrace(),
preempt_enable_notrace())
-DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
#ifdef CONFIG_PREEMPT_DYNAMIC
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 40d2fa90df42..365ac6d17504 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -48,6 +48,9 @@
#include <linux/uidgid_types.h>
#include <linux/tracepoint-defs.h>
#include <asm/kmap_size.h>
+#ifndef COMPILE_OFFSETS
+#include <generated/rq-offsets.h>
+#endif
/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -2299,4 +2302,127 @@ static __always_inline void
alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+#if defined(CONFIG_SMP) && !defined(COMPILE_OFFSETS)
+
+extern void __migrate_enable(void);
+
+struct rq;
+DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+#define this_rq_ptr() arch_raw_cpu_ptr(&runqueues)
+
+/*
+ * Migrate-Disable and why it is undesired.
+ *
+ * When a preempted task becomes elegible to run under the ideal model (IOW it
+ * becomes one of the M highest priority tasks), it might still have to wait
+ * for the preemptee's migrate_disable() section to complete. Thereby suffering
+ * a reduction in bandwidth in the exact duration of the migrate_disable()
+ * section.
+ *
+ * Per this argument, the change from preempt_disable() to migrate_disable()
+ * gets us:
+ *
+ * - a higher priority tasks gains reduced wake-up latency; with
preempt_disable()
+ * it would have had to wait for the lower priority task.
+ *
+ * - a lower priority tasks; which under preempt_disable() could've instantly
+ * migrated away when another CPU becomes available, is now constrained
+ * by the ability to push the higher priority task away, which
might itself be
+ * in a migrate_disable() section, reducing it's available bandwidth.
+ *
+ * IOW it trades latency / moves the interference term, but it stays in the
+ * system, and as long as it remains unbounded, the system is not fully
+ * deterministic.
+ *
+ *
+ * The reason we have it anyway.
+ *
+ * PREEMPT_RT breaks a number of assumptions traditionally held. By forcing a
+ * number of primitives into becoming preemptible, they would also allow
+ * migration. This turns out to break a bunch of per-cpu usage. To this end,
+ * all these primitives employ migirate_disable() to restore this implicit
+ * assumption.
+ *
+ * This is a 'temporary' work-around at best. The correct solution is getting
+ * rid of the above assumptions and reworking the code to employ explicit
+ * per-cpu locking or short preempt-disable regions.
+ *
+ * The end goal must be to get rid of migrate_disable(), alternatively we need
+ * a schedulability theory that does not depend on abritrary migration.
+ *
+ *
+ * Notes on the implementation.
+ *
+ * The implementation is particularly tricky since existing code patterns
+ * dictate neither migrate_disable() nor migrate_enable() is allowed to block.
+ * This means that it cannot use cpus_read_lock() to serialize against hotplug,
+ * nor can it easily migrate itself into a pending affinity mask change on
+ * migrate_enable().
+ *
+ *
+ * Note: even non-work-conserving schedulers like semi-partitioned depends on
+ * migration, so migrate_disable() is not only a problem for
+ * work-conserving schedulers.
+ *
+ */
+static inline void migrate_enable(void)
+{
+ struct task_struct *p = current;
+
+#ifdef CONFIG_DEBUG_PREEMPT
+ /*
+ * Check both overflow from migrate_disable() and superfluous
+ * migrate_enable().
+ */
+ if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
+ return;
+#endif
+
+ if (p->migration_disabled > 1) {
+ p->migration_disabled--;
+ return;
+ }
+
+ /*
+ * Ensure stop_task runs either before or after this, and that
+ * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
+ */
+ guard(preempt)();
+ __migrate_enable();
+ /*
+ * Mustn't clear migration_disabled() until cpus_ptr points back at the
+ * regular cpus_mask, otherwise things that race (eg.
+ * select_fallback_rq) get confused.
+ */
+ barrier();
+ p->migration_disabled = 0;
+ (*(unsigned int *)((void *)this_rq_ptr() + RQ_nr_pinned))--;
+}
+
+static inline void migrate_disable(void)
+{
+ struct task_struct *p = current;
+
+ if (p->migration_disabled) {
+#ifdef CONFIG_DEBUG_PREEMPT
+ /*
+ *Warn about overflow half-way through the range.
+ */
+ WARN_ON_ONCE((s16)p->migration_disabled < 0);
+#endif
+ p->migration_disabled++;
+ return;
+ }
+
+ guard(preempt)();
+ (*(unsigned int *)((void *)this_rq_ptr() + RQ_nr_pinned))++;
+ p->migration_disabled = 1;
+}
+#else
+static inline void migrate_disable(void) { }
+static inline void migrate_enable(void) { }
+#endif
+
+DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
+
#endif
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 399f03e62508..75d5f145ca60 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -23853,8 +23853,7 @@ int bpf_check_attach_target(struct
bpf_verifier_log *log,
BTF_SET_START(btf_id_deny)
BTF_ID_UNUSED
#ifdef CONFIG_SMP
-BTF_ID(func, migrate_disable)
-BTF_ID(func, migrate_enable)
+BTF_ID(func, __migrate_enable)
#endif
#if !defined CONFIG_PREEMPT_RCU && !defined CONFIG_TINY_RCU
BTF_ID(func, rcu_read_unlock_strict)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ec00d08d46a..b521024c99ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -119,6 +119,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+EXPORT_SYMBOL_GPL(runqueues);
#ifdef CONFIG_SCHED_PROXY_EXEC
DEFINE_STATIC_KEY_TRUE(__sched_proxy_exec);
@@ -2375,28 +2376,7 @@ static void migrate_disable_switch(struct rq
*rq, struct task_struct *p)
__do_set_cpus_allowed(p, &ac);
}
-void migrate_disable(void)
-{
- struct task_struct *p = current;
-
- if (p->migration_disabled) {
-#ifdef CONFIG_DEBUG_PREEMPT
- /*
- *Warn about overflow half-way through the range.
- */
- WARN_ON_ONCE((s16)p->migration_disabled < 0);
-#endif
- p->migration_disabled++;
- return;
- }
-
- guard(preempt)();
- this_rq()->nr_pinned++;
- p->migration_disabled = 1;
-}
-EXPORT_SYMBOL_GPL(migrate_disable);
-
-void migrate_enable(void)
+void __migrate_enable(void)
{
struct task_struct *p = current;
struct affinity_context ac = {
@@ -2404,37 +2384,10 @@ void migrate_enable(void)
.flags = SCA_MIGRATE_ENABLE,
};
-#ifdef CONFIG_DEBUG_PREEMPT
- /*
- * Check both overflow from migrate_disable() and superfluous
- * migrate_enable().
- */
- if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
- return;
-#endif
-
- if (p->migration_disabled > 1) {
- p->migration_disabled--;
- return;
- }
-
- /*
- * Ensure stop_task runs either before or after this, and that
- * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
- */
- guard(preempt)();
if (p->cpus_ptr != &p->cpus_mask)
__set_cpus_allowed_ptr(p, &ac);
- /*
- * Mustn't clear migration_disabled() until cpus_ptr points back at the
- * regular cpus_mask, otherwise things that race (eg.
- * select_fallback_rq) get confused.
- */
- barrier();
- p->migration_disabled = 0;
- this_rq()->nr_pinned--;
}
-EXPORT_SYMBOL_GPL(migrate_enable);
+EXPORT_SYMBOL_GPL(__migrate_enable);
static inline bool rq_has_pinned_tasks(struct rq *rq)
{
diff --git a/kernel/sched/rq-offsets.c b/kernel/sched/rq-offsets.c
new file mode 100644
index 000000000000..a23747bbe25b
--- /dev/null
+++ b/kernel/sched/rq-offsets.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#define COMPILE_OFFSETS
+#include <linux/kbuild.h>
+#include <linux/types.h>
+#include "sched.h"
+
+int main(void)
+{
+ DEFINE(RQ_nr_pinned, offsetof(struct rq, nr_pinned));
+
+ return 0;
+}
Powered by blists - more mailing lists