linux-kernel - Re: [patch V3 07/12] rseq: Implement syscall entry work for time slice extensions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <C9D3DC1A-CBF5-4AB3-B500-C932A6868B13@oracle.com>
Date: Thu, 20 Nov 2025 07:37:34 +0000
From: Prakash Sangappa <prakash.sangappa@...cle.com>
To: Thomas Gleixner <tglx@...utronix.de>
CC: LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra
	<peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Boqun Feng <boqun.feng@...il.com>, Jonathan Corbet <corbet@....net>,
        Madadi Vineeth Reddy
	<vineethr@...ux.ibm.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        Steven
 Rostedt <rostedt@...dmis.org>,
        Sebastian Andrzej Siewior
	<bigeasy@...utronix.de>,
        Arnd Bergmann <arnd@...db.de>,
        "linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>
Subject: Re: [patch V3 07/12] rseq: Implement syscall entry work for time
 slice extensions



> On Nov 19, 2025, at 7:25 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
> 
> On Wed, Nov 19 2025 at 00:20, Prakash Sangappa wrote:
>>> On Oct 29, 2025, at 6:22 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
>>> + if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
>>> + force_sig(SIGSEGV);
>>> +}
>> 
>> I have been trying to get our Database team to implement changes to
>> use the slice extension API.  They encounter the issue with a system
>> call being made within the slice extension window and the process dies
>> with SEGV.
> 
> Good. Works as designed.
> 
>> Apparently it will be hard to enforce not calling a system call in the
>> slice extension window due to layering.
> 
> Why do I have a smell of rotten onions in my nose right now?
> 
>> For the DB use case, It is fine to terminate the slice extension if a
>> system call is made, but the process getting killed will not work.
> 
> That's not a question of being fine or not.
> 
> The point is that on PREEMPT_NONE/VOLUNATRY that arbitrary syscall can
> consume tons of CPU cycles until it either schedules out voluntarily or
> reaches __exit_to_user_mode_loop(), which is defeating the whole
> mechanism. The timer does not help in that case because once the task is
> in the kernel it won't be preempted on return from interrupt.
> 
> sys_rseq_sched_yield() is time bound, which is why it was implemented
> that way.
> 
> I was absolutely right when I asked to tie this mechanism to
> PREEMPT_LAZY|FULL in the first place. That would nicely avoid the whole
> problem.
> 
> Something like the uncompiled and untested below should work. Though I
> hate it with a passion.

That works. It addresses DB issue.


> + * Grudgingly support onion layer applications which cannot
> + * guarantee that rseq_slice_yield() is used to yield the CPU for
> + * terminating a grant. This is a NOP on PREEMPT_FULL/LAZY because
> + * enabling preemption above already scheduled, but required for
> + * PREEMPT_NONE/VOLUNTARY to prevent that the slice is further
> + * expanded up to the point where the syscall code schedules
> + * voluntarily or reaches exit_to_user_mode_loop().
> */
> - if (put_user(0U, &curr->rseq.usrptr->slice_ctrl.all) || syscall != __NR_rseq_slice_yield)
> - force_sig(SIGSEGV);
> + if (syscall != __NR_rseq_slice_yield)
> + cond_resched();
> }

With this change, here are the ’swingbench’ performance results I received from our Database team.
https://www.dominicgiles.com/swingbench/

Kernel based on rseq/slice v3 + above change.
System: 2 socket AMD.
Cached DB config - i.e DB files cached on tmpfs.

Response from Database performance engineer:-
Overall the results are very positive and consistent with the earlier findings, we see a clear benefit from the optimization running the same tests as earlier.

• The sgrant figure in /sys/kernel/debug/rseq/stats increases with the DB side optimization enabled, while it stays flat when disabled.  I believe this indicates that both the kernel-side code & the DB side triggers are working as expected.

• Due to the contentious nature of the workload these tests produce highly erratic results, but the optimization is showing improved performance across 3x tests with/without use of time slice extension.

• Swingbench throughput with use of time slice optimization
	• Run 1: 50,008.10
	• Run 2: 59,160.60
	• Run 3: 67,342.70
• Swingbench throughput without use of time slice optimization
	• Run 1: 36,422.80
	• Run 2: 33,186.00
	• Run 3: 44,309.80
• The application performs 55% better on average with the optimization.

-Prakash