linux-kernel - [PATCH v6 0/3] Optimize code generation during context switching

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260124171546.43398-1-qq570070308@gmail.com>
Date: Sun, 25 Jan 2026 01:15:43 +0800
From: Xie Yuanbin <qq570070308@...il.com>
To: peterz@...radead.org,
	tglx@...nel.org,
	riel@...riel.com,
	segher@...nel.crashing.org,
	david@...nel.org,
	hpa@...or.com,
	arnd@...db.de,
	mingo@...hat.com,
	juri.lelli@...hat.com,
	vincent.guittot@...aro.org,
	dietmar.eggemann@....com,
	rostedt@...dmis.org,
	bsegall@...gle.com,
	mgorman@...e.de,
	vschneid@...hat.com,
	bp@...en8.de,
	dave.hansen@...ux.intel.com,
	luto@...nel.org,
	houwenlong.hwl@...group.com
Cc: linux-kernel@...r.kernel.org,
	x86@...nel.org,
	Xie Yuanbin <qq570070308@...il.com>
Subject: [PATCH v6 0/3] Optimize code generation during context switching

This series optimize the performance of context switching. They do not
modify the code logic, but only change the inline attributes of some
functions.

It is found that finish_task_switch() is not inlined even in the O2 level
optimization. Performance testing indicated that this could lead to a
significant performance degradation when certain Spectre vulnerability
mitigations are enabled. This may be due to the following reasons:

1. In switch_mm_irq_off(), some mitigations may clear branch prediction
history, or even clear the instruction cache. For example
arm64_apply_bp_hardening() on arm64, BPIALL/ICIALLU on arm, and
indirect_branch_prediction_barrier() on x86. finish_task_switch()
is right after switch_mm_irqs_off(), so the performance here is
greatly affected by function calls and branch jumps.

2. __schedule() has a __sched attribute, which makes it be placed in
'.sched.text' section, while finish_task_switch() does not. This makes
they "far away from each other" in vmlinux, which aggravating the
performance degradation.

This series of patches primarily make some functions called in context
switching as always inline to optimize performance. Here is the test data:
Performance test data - time spent on calling finish_task_switch():
1. x86-64: Intel i5-8300h@...z, DDR4@...6mhz; unit: x86's tsc
 | test scenario                     |    old |   new |           delta |
 | gcc 15.2                          |  27.50 | 25.45 |  -2.05 ( -7.5%) |
 | gcc 15.2     + spectre_v2_user=on |  46.75 | 25.96 | -20.79 (-44.5%) |
 | clang 21.1.7                      |  27.25 | 25.45 |  -1.80 ( -6.6%) |
 | clang 21.1.7 + spectre_v2_user=on |  39.50 | 26.00 | -13.50 (-34.2%) |

2. x86-64: AMD 9600x@...5Ghz, DDR5@...0mhz; unit: x86's tsc
 | test scenario                     |    old |   new |           delta |
 | gcc 15.2                          |  27.51 | 27.51 |      0 (    0%) |
 | gcc 15.2     + spectre_v2_user=on | 105.21 | 67.89 | -37.32 (-35.5%) |
 | clang 21.1.7                      |  27.51 | 27.51 |      0 (    0%) |
 | clang 21.1.7 + spectre_v2_user=on | 104.15 | 67.52 | -36.63 (-35.2%) |

3. arm64: Raspberry Pi 3b Rev 1.2, Cortex-A53@...Ghz, unaffected by
          Spectre v2 vulnerability; unit: cntvct_el0
 | test scenario                     |    old |   new |           delta |
 | gcc 15.2                          |  1.453 | 1.115 | -0.338 (-23.3%) |
 | clang 21.1.7                      |  1.532 | 1.123 | -0.409 (-26.7%) |

4. arm32: Raspberry Pi 3b Rev 1.2, Cortex-A53@...Ghz, unaffected by
          Spectre v2 vulnerability; unit: cntvct_el0
 | test scenario                     |    old |   new |           delta |
 | gcc 15.2                          |  1.421 | 1.187 | -0.234 (-16.5%) |
 | clang 21.1.7                      |  1.437 | 1.200 | -0.237 (-16.5%) |

Size test data:
1. bzImage size:
 | test scenario             | old      | new      | delta |
 | gcc 15.2     + -Os        | 12604416 | 12604416 |     0 |
 | gcc 15.2     + -O2        | 14500864 | 14500864 |     0 |
 | clang 21.1.7 + -Os        | 13718528 | 13718528 |     0 |
 | clang 21.1.7 + -O2        | 14558208 | 14566400 |  8192 |

2. sizeof .text section from vmlinx:
 | test scenario             | old      | new      | delta |
 | gcc 15.2     + -Os        | 16180040 | 16180616 |   576 |
 | gcc 15.2     + -O2        | 19556424 | 19561352 |  4928 |
 | clang 21.1.7 + -Os        | 17917832 | 17918664 |   832 |
 | clang 21.1.7 + -O2        | 20030856 | 20035784 |  4928 |

Test information:
1. Linux kernel source: commit d9771d0dbe18dd643760 ("Add linux-next
specific files for 20251212") from linux-next branch.

2. kernel config for performance test:
x86-64: `make x86_64_defconfig` first, then menuconfig setting:
CONFIG_HZ=100
CONFIG_DEBUG_ENTRY=n
CONFIG_X86_DEBUG_FPU=n
CONFIG_EXPERT=y
CONFIG_MODIFY_LDT_SYSCALL=n
CONFIG_STACKPROTECTOR=n
CONFIG_BLK_DEV_NVME=y (just for boot)

arm64: `make defconfig` first, then menuconfig setting:
CONFIG_KVM=n
CONFIG_HZ=100
CONFIG_SHADOW_CALL_STACK=y

arm32: `make multi_v7_defconfig` first, then menuconfig setting:
CONFIG_ARCH_OMAP2PLUS_TYPICAL=n
CONFIG_HIGHMEM=n

3. kernel config for size test:
`make x86_64_defconfig` first, then menuconfig setting:
CONFIG_SCHED_CORE=y
CONFIG_NO_HZ_FULL=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y (optional)

4. Compiler:
llvm: Debian clang version 21.1.7 (1) + Debian LLD 21.1.7
gcc: x86-64: gcc version 15.2.0 (Debian 15.2.0-11)
     arm64/arm32: gcc version 15.2.0 (Debian 15.2.0-7) +
     GNU ld (GNU Binutils for Debian) 2.45.50.20251209

5. When testing on Raspberry Pi 3b, in order to make the test result
stable, the CPU frequency should be fixed. The following content was
added to config.txt:
```config.txt
arm_boost=0
core_freq_fixed=1
arm_freq=1200
gpu_freq=250
sdram_freq=400
arm_freq_min=1200
gpu_freq_min=250
sdram_freq_min=400
```

6. cmdline configuration:
6.1 add `isolcpus=3` to obtain more stable test results (assuming the
    test is run on cpu3).
6.2 optional: add `spectre_v2_user=on` on x86-64 to enable mitigations.

7. Performance testing code and operations:
kernel code:
```patch
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fd09afae72a2..40ce1b28cb27 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -485,3 +485,4 @@
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
+471	common	sched_test			sys_sched_test
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8a4ac4841be6..5a42ec008620 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -395,6 +395,7 @@
 468	common	file_getattr		sys_file_getattr
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
+471	common	sched_test		sys_sched_test
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cf84d98964b2..53f0d2e745bd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,7 @@ asmlinkage long sys_listmount(const struct mnt_id_req __user *req,
 asmlinkage long sys_listns(const struct ns_id_req __user *req,
 			   u64 __user *ns_ids, size_t nr_ns_ids,
 			   unsigned int flags);
+asmlinkage long sys_sched_test(void);
 asmlinkage long sys_truncate(const char __user *path, long length);
 asmlinkage long sys_ftruncate(unsigned int fd, off_t length);
 #if BITS_PER_LONG == 32
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 942370b3f5d2..65023afc291b 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -860,8 +860,11 @@ __SYSCALL(__NR_file_setattr, sys_file_setattr)
 #define __NR_listns 470
 __SYSCALL(__NR_listns, sys_listns)
 
+#define __NR_listns 471
+__SYSCALL(__NR_sched_test, sys_sched_test)
+
 #undef __NR_syscalls
-#define __NR_syscalls 471
+#define __NR_syscalls 472
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..f53a423c8600 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5191,6 +5191,31 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 	calculate_sigpending();
 }
 
+static DEFINE_PER_CPU(uint64_t, total_time);
+
+static __always_inline uint64_t test_gettime(void)
+{
+#ifdef CONFIG_X86_64
+	register uint64_t rax __asm__("rax");
+	register uint64_t rdx __asm__("rdx");
+
+	__asm__ __volatile__ ("rdtsc" : "=a"(rax), "=d"(rdx));
+	return rax | (rdx << 32);
+#elif defined(CONFIG_ARM64)
+	uint64_t ret;
+
+	__asm__ __volatile__ ("mrs %0, cntvct_el0" : "=r"(ret));
+	return ret;
+#elif defined(CONFIG_ARM)
+	uint64_t ret;
+
+	__asm__ __volatile__ ("mrrc p15, 1, %Q0, %R0, c14" : "=r" (ret));
+	return ret;
+#else
+#error "Not support"
+#endif
+}
+
 /*
  * context_switch - switch to the new MM and the new thread's register state.
  */
@@ -5256,7 +5281,15 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	switch_to(prev, next, prev);
 	barrier();
 
-	return finish_task_switch(prev);
+	{
+		uint64_t end_time;
+		// add volatile to let it alloc on stack
+		__volatile__ uint64_t start_time = test_gettime();
+		rq = finish_task_switch(prev);
+		end_time = test_gettime();
+		raw_cpu_add(total_time, end_time - start_time);
+	}
+	return rq;
 }
 
 /*
@@ -10827,3 +10860,32 @@ void sched_change_end(struct sched_change_ctx *ctx)
 		p->sched_class->prio_changed(rq, p, ctx->prio);
 	}
 }
+
+static struct task_struct *wait_task;
+#define PRINT_PERIOD (1U << 20)
+static DEFINE_PER_CPU(uint32_t, total_count);
+
+SYSCALL_DEFINE0(sched_test)
+{
+	preempt_disable();
+	while (1) {
+		if (likely(wait_task))
+			wake_up_process(wait_task);
+		wait_task = current;
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		__schedule(SM_NONE);
+		if (unlikely(raw_cpu_inc_return(total_count) == PRINT_PERIOD)) {
+			const uint64_t total = raw_cpu_read(total_time);
+			uint64_t tmp_h, tmp_l;
+
+			tmp_h = total * 100000;
+			do_div(tmp_h, (uint32_t)PRINT_PERIOD);
+			tmp_l = do_div(tmp_h, (uint32_t)100000);
+
+			pr_emerg("cpu[%d]: total cost time %llu in %u tests, %llu.%05llu per test\n", raw_smp_processor_id(), total, PRINT_PERIOD, tmp_h, tmp_l);
+			raw_cpu_write(total_time, 0);
+			raw_cpu_write(total_count, 0);
+		}
+	}
+	return 0;
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index e74868be513c..2a2d8d44cb3f 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -411,3 +411,4 @@
 468	common	file_getattr			sys_file_getattr
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
+471	common	sched_test			sys_sched_test
```

User-mode test program code:
```c
int main()
{
	cpu_set_t mask;

	if (fork())
		sleep(1);

	CPU_ZERO(&mask);
	CPU_SET(3, &mask); // Assume that cpu3 exists
	assert(sched_setaffinity(0, sizeof(mask), &mask) == 0);
	syscall(471);
	// unreachable
	return 0;
}
```

Test operation:
1. Apply the above kernel patch and build the kernel.
2. Add `isolcpus=3` to kernel cmdline and boot.
3. Run the above user program.
4. Wait for kernel print.

v5->v6: https://lore.kernel.org/20251214190907.184793-1-qq570070308@gmail.com
  - Based on tglx's suggestion, move '#define enter_....' under the
    inline function in patch [1/3].
  - Based on tglx's suggestion, correct the description error
    in patch [1/3].
  - Rebase to the latest linux-next source.

v4->v5: https://lore.kernel.org/20251123121827.1304-1-qq570070308@gmail.com
  - Rebase to the latest linux-next source.
  - Improve the test code and retest.
  - Add the test of AMD 9600x and Raspberry Pi 3b.

v3->v4: https://lore.kernel.org/20251113105227.57650-1-qq570070308@gmail.com
  - Improve the commit message

v2->v3: https://lore.kernel.org/20251108172346.263590-1-qq570070308@gmail.com
  - Fix building error in patch 1
  - Simply add the __always_inline attribute to the existing function,
    Instead of adding the always inline version functions

v1->v2: https://lore.kernel.org/20251024182628.68921-1-qq570070308@gmail.com
  - Make raw_spin_rq_unlock() inline
  - Make __balance_callbacks() inline
  - Add comments for always inline functions
  - Add Performance Test Data

Xie Yuanbin (3):
  x86/mm/tlb: Make enter_lazy_tlb() always inline on x86
  sched: Make raw_spin_rq_unlock() inline
  sched/core: Make finish_task_switch() and its subfunctions always
    inline

 arch/arm/include/asm/mmu_context.h      |  2 +-
 arch/riscv/include/asm/sync_core.h      |  2 +-
 arch/s390/include/asm/mmu_context.h     |  2 +-
 arch/sparc/include/asm/mmu_context_64.h |  2 +-
 arch/x86/include/asm/mmu_context.h      | 23 +++++++++++++++++-
 arch/x86/include/asm/sync_core.h        |  2 +-
 arch/x86/mm/tlb.c                       | 21 -----------------
 include/linux/perf_event.h              |  2 +-
 include/linux/sched/mm.h                | 10 ++++----
 include/linux/tick.h                    |  4 ++--
 include/linux/vtime.h                   |  8 +++----
 kernel/sched/core.c                     | 17 +++++---------
 kernel/sched/sched.h                    | 31 ++++++++++++++-----------
 13 files changed, 62 insertions(+), 64 deletions(-)

-- 
2.51.0