linux-kernel - Re: [PATCH] sched: Further restrict the preemption modes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a86b6bbd-c0ed-40dc-899f-ba162332c80a@linux.ibm.com>
Date: Fri, 9 Jan 2026 16:53:04 +0530
From: Shrikanth Hegde <sshegde@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, vschneid@...hat.com, clrkwllms@...nel.org,
        linux-kernel@...r.kernel.org, linux-rt-devel@...ts.linux.dev,
        Linus Torvalds <torvalds@...ux-foundation.org>, mingo@...nel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH] sched: Further restrict the preemption modes

Hi Peter.

On 12/19/25 3:45 PM, Peter Zijlstra wrote:
> 
> [ with 6.18 being an LTS release, it might be a good time for this ]
> 
> The introduction of PREEMPT_LAZY was for multiple reasons:
> 
>    - PREEMPT_RT suffered from over-scheduling, hurting performance compared to
>      !PREEMPT_RT.
> 
>    - the introduction of (more) features that rely on preemption; like
>      folio_zero_user() which can do large memset() without preemption checks.
> 
>      (Xen already had a horrible hack to deal with long running hypercalls)
> 
>    - the endless and uncontrolled sprinkling of cond_resched() -- mostly cargo
>      cult or in response to poor to replicate workloads.
> 
> By moving to a model that is fundamentally preemptable these things become
> manageable and avoid needing to introduce more horrible hacks.
> 
> Since this is a requirement; limit PREEMPT_NONE to architectures that do not
> support preemption at all. Further limit PREEMPT_VOLUNTARY to those
> architectures that do not yet have PREEMPT_LAZY support (with the eventual goal
> to make this the empty set and completely remove voluntary preemption and
> cond_resched() -- notably VOLUNTARY is already limited to !ARCH_NO_PREEMPT.)
> 
> This leaves up-to-date architectures (arm64, loongarch, powerpc, riscv, s390,
> x86) with only two preemption models: full and lazy (like PREEMPT_RT).
> 
> While Lazy has been the recommended setting for a while, not all distributions
> have managed to make the switch yet. Force things along. Keep the patch minimal
> in case of hard to address regressions that might pop up.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> ---
>   kernel/Kconfig.preempt |    3 +++
>   kernel/sched/core.c    |    2 +-
>   kernel/sched/debug.c   |    2 +-
>   3 files changed, 5 insertions(+), 2 deletions(-)
> 
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -16,11 +16,13 @@ config ARCH_HAS_PREEMPT_LAZY
>   
>   choice
>   	prompt "Preemption Model"
> +	default PREEMPT_LAZY if ARCH_HAS_PREEMPT_LAZY
>   	default PREEMPT_NONE
>   
>   config PREEMPT_NONE
>   	bool "No Forced Preemption (Server)"
>   	depends on !PREEMPT_RT
> +	depends on ARCH_NO_PREEMPT
>   	select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
>   	help
>   	  This is the traditional Linux preemption model, geared towards
> @@ -35,6 +37,7 @@ config PREEMPT_NONE
>   
>   config PREEMPT_VOLUNTARY
>   	bool "Voluntary Kernel Preemption (Desktop)"
> +	depends on !ARCH_HAS_PREEMPT_LAZY
>   	depends on !ARCH_NO_PREEMPT
>   	depends on !PREEMPT_RT
>   	select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7553,7 +7553,7 @@ int preempt_dynamic_mode = preempt_dynam
>   
>   int sched_dynamic_mode(const char *str)
>   {
> -# ifndef CONFIG_PREEMPT_RT
> +# if !(defined(CONFIG_PREEMPT_RT) || defined(CONFIG_ARCH_HAS_PREEMPT_LAZY))
>   	if (!strcmp(str, "none"))
>   		return preempt_dynamic_none;
>   
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -243,7 +243,7 @@ static ssize_t sched_dynamic_write(struc
>   
>   static int sched_dynamic_show(struct seq_file *m, void *v)
>   {
> -	int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
> +	int i = (IS_ENABLED(CONFIG_PREEMPT_RT) || IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY)) * 2;
>   	int j;
>   
>   	/* Count entries in NULL terminated preempt_modes */

Maybe only change the default to LAZY, but keep other options possible 
via dynamic update?

- When the kernel changes to lazy being the default, the scheduling 
pattern can change and it may affect the workloads. having ability to 
dynamically change to none/voluntary could help one to figure out where
it is regressing. we could document cases where regression is expected.

- with preempt=full/lazy we will likely never see softlockups. How are 
we going to find out longer kernel paths(some maybe design, some may be 
bugs) apart from observing workload regression?


Also, is softlockup code is of any use in preempt=full/lazy?